Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.
Citation: Aljohi HA, Liu W, Lin Q, Zhao Y, Zeng J, Alamer A, et al. (2016) Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome. PLoS ONE 11(10): e0163990. https://doi.org/10.1371/journal.pone.0163990
Editor: Hector Candela, Universidad Miguel Hernández de Elche, SPAIN
Received: April 19, 2016; Accepted: September 19, 2016; Published: October 13, 2016
Copyright: © 2016 Aljohi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All Roche/454 data were deposited at BIGD database (http://gsa.big.ac.cn, CRX007340 and CRX007339). HiSeq data used for coconut mt genome correction was deposited at BIGD database (CRX007360, CRX007361 and CRX007362). The complete coconut mt genome sequence was deposited to GenBank (accession number KX028885). The RNA-seq data was downloaded from SRA (SRR1063404, SRR1063407, SRR1137438, SRR1173229, SRR1265939, SRR1273070, SRR1273180, SRR606452).
Funding: The work is supported by KACST grant 1035-35 to Hasan from the Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology (KACST), Kingdom of Saudi Arabia and the grants from the National Natural Science Foundation of China (31200957 to Qiang and 31501042 to Wanfei).
Competing interests: The authors have declared that no competing interests exist.
The plant mitochondrial (mt) genome is considered as a remnant of an ancestral α-proteobacterium that was symbiont in its eukaryotic common ancestor . It is involved in cellular energy production by respiration and various cellular function regulations, such as homeostasis, apoptosis, and metabolite biosynthesis . Since the first mt genome of land plants was published (Marchantia polymorpha, liverwort) , there had been 303 mt genomes available until December 9, 2015 in the NCBI organelle database . Plant mt genomes have several characteristics that make them important for evolutionary studies. First, plant mt gene contents are highly variable across plant taxa , obtaining genes from both plastid and nuclear genomes from intracellular transfer [6–8], as well as other species via horizontal transfer [9, 10]; plant mt genes can also be transferred to their nuclear genomes . Second, plant mt genomes evolve more rapidly in their structure, but slower in their primary sequence [12, 13] as compared to both the chloroplast (cp) and nuclear counterparts. The genome-size expansion of plant mt genomes primarily reflects the increase of intronic and intergenic DNA  as plant mt genomes have dramatically lower mutation rates when compared to both their cp and nuclear counterparts [14, 15]. Third, plant mt genomes have a large number of copies per cell and show a remarkable amount of rearrangements . A recent study has also shown that copies of the Silene noctiflora mt genome can be gained or lost, and the fact emphasizes evolutionary difference among them, the mt, the cp, and the nuclear genomes . Fourth, plant mt genomes have a large number of intron-containing genes; some of them need trans-splicing to produce complete transcripts . Fifth, plant mt genomes have a high frequency of RNA editing that contributes to functional conservation for the mt proteins [19, 20]. In plants, RNA editing affects mitochondrial and plastid transcripts by site-specific modification of cytidines-to-uridines and the reverse [20–23]. Taken together, the characteristics of plant mt genomes highlight difficulty of sequence assembly and analysis. Recently, we have released a mt genome assembly of date palm (Phoenix dactylifera) as the first one of the palm family , and now we add another, that of the coconut (Cocos nucifera L.), as the second of the palm family.
C. nucifera or coconut is one of the most economically important crops in tropics, serving as a source of food, drink, fuel, medicine, and construction material . Although the plant has significant economic values as a significant crop, there have been a limited number of studies on its genome. Based on a flow cytometric analysis, a diploid genome of coconut is 5.966 ± 0.111 pg or 5.757 Gb in size, i.e., its haploid counterpart is 2.8785 Gb . Genome sequencing data also supports this estimate, showing a genome size of ~2.6 Gb with 50% to 70% repeat contents . Recently, several coconut transcriptomic studies have been reported [28–30], providing datasets for de novo transcriptomic assembly and other molecular studies. The coconut cultivars are generally classified into two types, the Tall and the Dwarf. In this study, we present the result for the first coconut mt genome of the Oman local Tall variety. We first acquired high-throughput sequences of total cellular DNA using the Roche/454 platform and assembled them into a complete chromosome, and then corrected some of the sequence variations using Illumina HiSeq data. We also analyzed the genome assembly using transcriptomic data for genome structure and functional genes based on various comparative genome analysis tools.
Materials and Methods
Fresh green leaves from an adult coconut plant of the Tall cultivar located at Salalah, Dhofar Governorate, Oman, were collected, washed with double-distilled water, and frozen immediately in a liquid nitrogen container. The farm is owned by one of the co-authors of this work, Dr. Abdullah M. Al-Sadi, who is employed by Sultan Qaboos University and to whom future inquiry should be addressed. This study does not involve endangered or protected species and does not require specific permission from regulatory authority concerning wildlife protection. After being transported to the laboratory, these samples were stored in -80°C freezers until use.
Genomic DNA isolation and sequencing
The raw coconut mt genome sequences were extracted from those produced as part of the Palm Plant Genome Project (a joint effort between KACST and BIG, CAS). Genomic DNA was isolated from 50g fresh leaves according to a CTAB-based method . 5mg purified DNA was used for library construction for both single-read and paired-read libraries with 3kb and 8kb insert sizes according to the manufacturer’s manual for GS FLX Titanium. The libraries were amplified and sequenced on the Roche/454 GS FLX platform. All Roche/454 data was deposited at BIGD database (http://gsa.big.ac.cn, CRX007340 and CRX007339). The same purified DNA sample was also used for constructing the Illumina HiSeq libraries. HiSeq paired-end (< = 500bp) and mate-pair libraries (1kb to 8kb) were constructed using the Illumina Simple Paired-End Library and Mate-Pair Library Preparation Protocol, respectively. The libraries were sequenced by Illumina HiSeq 2000 platform. HiSeq data used for coconut mt genome correction was deposited at BIGD database (CRX007360, CRX007361 and CRX007362).
Sequence assembly and validation
We first assembled total reads from 13 single-read datasets and 12 paired-read datasets into 573,893 contigs using Newbler 2.6 (with “-a 0” option and default for others), a de novo sequence assembly software. We then aligned the assembled contigs to 234 published land plant mt genomes downloaded from NCBI organelle database at September 16, 2015 using BLAST (identity> = 80%, E-value< = 10−5 and overlap percent> = 30%) [32–34]. We next used 353 annotated contigs (length ranging from 102bp to 49,695bp with median size in 399bp) to build scaffolds using bb.454contignet and manually checked based on contig coverage and spanning reads in Newbler assemblies . We finally obtained a single scaffold of 678,112bp in length without gaps from 143 overlapping contigs.
To correct the sequence errors that are unique to the Roche/454 platform in the assembly, such as homopolymers (characteristic of the pyrosequencing), we used HiSeq paired-end data (180bp insert size) and bowtie2 (version 2.2.4) . The consensus sequence was obtained by using samtools (version 1.2) [37, 38] and bcftools (version 1.2) . The length became 678,133bp after this correction. As a byproduct, we identified several pseudogenes due to frame-shifts caused by homopolymers. We checked the final assembly manually based on Roche/454 and HiSeq paired-end data using IGV software (version 2.3.61) and revised 687 loci with 528 indels and 159 SNPs [40, 41]. Finally, we obtained a new length of 678,653bp with average sequence depths of ~42x for Roche/454 data and ~1788x for HiSeq data. We checked this assembly using HiSeq mate-pair data with insert sizes of 5kb and 8kb in a 5kb and 8kb sliding windows, respectively. On average, our final genome assembly was supported by 59.57% and 58.37% mate-pair reads from the 5-kb and 8-kb libraries. The complete mt genome sequence was deposited to GenBank (accession number KX028885).
We aligned our assembly to the mt genes from 18 representative land plants with BLAST (identity > = 80% and E-value < = 1e-5) and identified ORFs using ORF finder (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) . Introns were depicted by using Rfam (v1.1 with default parameters, http://rfam.xfam.org)  (S1 Table) and tRNA genes were identified by using BLAST (v2.2.26+) and tRNAscan-SE (v1.23) . All rRNA genes were identified similarly. The cp-derived regions were identified by comparing mt genome with cp genome (GenBank accession number KX028884) based on BLAST (identity > = 80%, E-value < = 1e-5 and length > = 50bp). REPuter and tandem repeat finder were used to identify forward, palindromic, and tandem repeats (https://bibiserv2.cebitec.uni-bielefeld.de/reputer and http://tandem.bu.edu/trf/trf.html) [44, 45].
Sequence variants were identified based on HiSeq paired-end data with 180-bp insert size. The raw reads were mapped to the final mt genome by using bowtie2 (version 2.2.4) , and the variants were called by using RGAAT tool, which developed in our laboratory (https://sourceforge.net/projects/rgaat/), and samtools and bcftools (version 1.2) [37–39]. To eliminate false positives, we only kept the variations identified by both methods. To evaluate the variations between the two palm species, C. nucifera and P. dactylifera, MUMmer3 was used for genome alignment .
RNA editing analysis
We predicted putative RNA editing sites based on 8 public RNA-Seq datasets of coconut palm (SRR1063404, SRR1063407, SRR1137438, SRR1173229, SRR1265939, SRR1273070, SRR1273180, and SRR606452). After filtering the low quality reads and removing the adapter sequences by Trimmomatic (version 0.33) , we mapped all high-quality reads to the mt genome using GSNAP (version 2014-12-19) with the options “-N 1 and -force-xs-dir” (all other options are default) . The candidate RNA editing loci were filtered through read mapping with the following criteria: (1) there are more than 2 aligned reads for each alternative allele, and (2) the percentage of the alternative allele must be equal or above 50%. We identified 845 RNA editing sites using REDO tool (https://sourceforge.net/projects/redo/) and predicted putative RNA editing sites in protein-coding genes using the web-based PREP-mt program with cutoff score 0.6 (http://prep.unl.edu/) .
Thirty-one representative mt protein coding genes were extracted from 19 species, including 8 monocots, 6 eudicots, and one each from gymnosperm (Cycas taitungensis), vascular plant (Phlegmariurus squarrosus), liverwort (M. polymorpha), hornwort (Phaeoceros laevis), and moss (Physcomitrella patens). Their amino acid sequences were aligned by using clustalw2 (version 2.1) . We used both statistical method, Maximum Likelihood (ML) with Jones-Taylor-Thornton (JTT) substitution model and Maximum Parsimony (MP) in MEGA (version 6.06) for phylogenies of concatenated aligned sequences with 1000 bootstrap . The gaps or missing data were eliminated when the site coverage below 90%. Phylogenetic trees were visualized with EvolView program .
We counted the number of reads for each gene for mt genome using an in-house Perl script and identified differentially expressed genes using DESeq (version 1.20.0) . For identifying the novel genes, we used Trinity (version 2.0.6) to construct transcripts based on GSNAP mapping results . If different mt genes were assembled into one sequence, we assigned them to polycistronic transcription unit.
Results and Discussion
The C. nucifera mt genome content
We started C. nucifera (Oman local Tall variety) mt genome assembly based solely on the Roche/454 GS FLX data, including 7,617,799 single reads, 2,884,708 paired reads with 3-kb insert size, and 1,594,036 paired reads with 8-kb insert size. After homopolymer correction using the Illumina reads, we have an assembly of 678,653bp in length (Fig 1; see Materials and Methods). It encodes 72 proteins (87 protein-coding genes, 8.62% of mt genome), 9 truncated proteins (codon frameshift mutations; 10 pseudogenes, 0.83% of mt genome), 23 tRNAs (corresponding to 17 amino acid codons and one stop codon, 42 tRNA-coding genes, 0.46% of mt genome), and 3 ribosomal RNAs (6 rRNA-coding genes, 1.51% of mt genome), which all together constitute a gene content of 11.43% (77,542bp) (Table 1). Among them, 13 proteins (15 protein-coding genes), 2 truncated proteins (codon frameshift; 3 pseudogenes), 11 tRNAs (corresponding to 10 amino acid codons, 13 tRNA-coding genes) and 3 ribosomal RNAs (3 rRNA-coding genes) locate in the chloroplast-derived regions, which are accounted for 5.07% of the genome sequence. The GC contents of protein-coding genes, pseudogenes, tRNAs, rRNAs, and the remaining non-coding sequences are 44.5% (58,895bp), 47.7% (5,294bp), 41.1% (3,092bp), 53.5% (10,261bp), and 45.5% (601,111bp), respectively. The genome harbors 0.49% tandem (3,310bp) and 17.26% long repeats (≥100bp). In addition, there are 13 co-transcribed gene clusters, including conserved 18S-5S rRNA and nad3-rps12 among angiosperm mt genomes . Our phylogenetic analysis shows that C. nucifera clusters with P. dactylifera and Butomus umbellatus among the monocotyledon plants (Fig 2).
We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (nad in red; cob, matR and mttB in green; cox in blue; atp in purple; ccm in orange; rpl in yellow; rps in dark red; rRNA in dark green; tRNA in dark blue; orf in dark purple; and others in black); chloroplast-derived regions (green); repeats (forward repeats in green, palindrome repeats in red and tandem repeats in blue); RNA edit sites (synonymous in green and non-synonymous in red); gene conserve scores (black); proper HiSeq mate-pair (MP) reads percent with insert size 5kb and 8kb (blue); and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogenes.
Protein-coding, rRNA, and tRNA genes
The C. nucifera mt genome encodes 50 known functional and 22 hypothetical proteins. Among the first group, 23 proteins are related to the electron transport chain, including 9 subunits of nicotinamide adenine dinucleotide dehydrogenase (complex I), one subunit of succinate dehydrogenase (complex II), one apocytochrome b (complex III), 3 subunits of cytochrome c oxidase (complex IV), 5 subunits of ATP synthase F1 (complex V), and 4 cytochrome c biogenesis proteins (Table 1).
First, when compared the C. nucifera mt proteins to 18 other plants (S1 Table and S1 Fig), we identified sdh gene that is unique to the coconut and absent in 7 other monocots. Second, similar in the cases of Vitis vinifera, S. latifolia, and P. dactylifera, RNA polymerase genes are identified in the mt genome (one RNA polymerase and one DNA-dependent RNA polymerase). Third, the C. nucifera mt genome has the highest copy number of rps19 genes (5 copies) in all 19 inspected species, followed by V. vinifera (3 copies). Fourth, there is no rps3 gene in C. nucifera mt genome, whereas it exists in 7 other monocot species. Fifth, rpl10 (pseudogene) and rps11 (protein-coding gene) are found only in P. dactylifera and C. nucifera among all 8 monocots. Last, a few of cp-derived genes are identified in this genome, including 15 protein-coding (such as rpl14, rpl33 and rps14), 3 rRNA, and 13 tRNA genes as well as 3 pseudogenes.
The mt genome contains 42 tRNA genes; 12 of them have introns (9 mt tRNAs and 3 cp-derived tRNAs) (Table 2). Among these tRNA genes, all correspond to 17 amino acids but are absent for the rest three: Ala, Leu, and Val. The tRNA genes for amino acids Thr, His, Arg, Gly, and tRNAIle(GAU) are only found in the cp-derived regions. These results are consistent with previous studies that the mt tRNA genes are replaced by those of the cp-derived tRNA gradually [24, 56].
Cp-derived regions, introns, and repeats
The plant cp and mt genomes are known to have extensive and widespread homologies due to sequence transfer [57, 58]. The transfer of cp genomic DNA to that of the mt genome has been going on for at least 300 million years . In the C. nucifera mt genome, there are 33 cp-derived regions with a length range of 64 to 3,365bp (S2 Table). The total length of cp-derived regions is 34,395bp and the coding region is 37.58% (12,925bp), which is higher than mt gene content (11.41%) but lower than cp gene content (61.17%). The GC content of the cp-derived regions is 41.9%, which is between those of the cp (37.44%) and mt (45.50%) genomes. A similar trend is found in P. dactylifera with GC contents of 37.23%, 37.40%, and 45.1% for cp, cp-derived region, and mt DNA, respectively [24, 60]. These results suggest that cp-derived sequences, to some extent, have evolved to be close to the mt genome sequences in GC contents and gene coding fractions after being transferred into mt genomes.
In the C. nucifera mt genome, there are 28 intron-containing genes (16 protein-coding genes and 12 tRNA genes), and according to the prediction based on Rfam, one group I intron (not located in gene regions) and 23 group II introns were identified. Among 23 group II introns, 15 locate in 8 protein-coding genes (nad1, nad2, nad4, nad5, nad7, rps10, cox2a and cox2b) and 2 are in 2 tRNA genes (two trnI-GAU). Although mitochondrial tRNA genes do not possess introns in general, we identified 12 intron-containing tRNA genes (including 3 cp-derived tRNA genes) in the assembly. Among 18 other plants (S1 Table), M. polymorpha (liverwort), C. taitungensis (gymnosperm), B. umbellatus (monocot), P. dactylifera (monocot), Zea mays (monocot), and V. vinifera (eudicot) have one (tRNA-Ser), one (tRNA-Val), two (tRNA-Ile and tRNA-Ala), three (tRNA-Lys, tRNA-Asn and tRNA-Stop), three (tRNA-Leu/pseudo, tRNA-Leu/pseudo and tRNA-Ile), and one (tRNA-Lys) intron-containing tRNA genes, respectively. It shows that the C. nucifera mt genome has the largest intron-containing tRNA genes among all analyzed sequences.
The C. nucifera mt genome contains 0.49% tandem repeats, which are compatible with those of P. dactylifera (0.33%) (S3 Table). However, it harbors 17.26% long repeats (> = 100bp), and the number is significantly higher than that of P. dactylifera (2.3%) but compatible with those of other monocot species, such as Triticum aestivum (15.9%), Sorghum bicolor (16.2%), and Zea may (19.1%) (S4 Table).
Sequence variation analysis
Based on the HiSeq data, we identified 202 and 157 variations in different places of the genome, using samtools & bcftools and RGAAT (https://sourceforge.net/projects/rgaat/), respectively; among the total, 102 variations are cross-discovered based on both methods (S5 Table). To reduce false positives, we only used the 102 shared variations (100 SNPs and 2 insertions) for further analysis. First, 48 out of the total are found in the cp-derived regions. Among all variations, only 5 SNPs are in the protein-coding genes, including 3 synonymous SNPs of rps1, rps2, and rpl14 (cp-derived) and two non-synonymous SNPs of orf247-ct (S6 Table). Other 6 SNPs and 1 insertion are found in 5 tRNA genes, whereas the remaining 89 SNPs and 1 insertion are non-coding. Second, according to the variation types, there are 23 transitions (Ti) and 77 transversions (Tv), leading to a Ti/Tv ratio of 0.30. If we remove the cp-derived regions from the analysis, the ratio becomes 0.06 (Ti/Tv ratio; 3 Ti and 50 Tv SNPs). It is in sharp contrast to that of the nuclear genome, where the ratios range ~2.0–2.1 in genome-wide and 3.0–3.3 in exonic sequences [61, 62]. The Ti/Tv ratio in the coconut mt genome is much lower than what is in the nuclear genome, as well as a random prediction (0.5). It supports the observation that DNA replication and repair mechanisms are very different between mt and nuclear genomes. Third, we further scrutinized the data to exclude other possibilities that may lead to biased results. According to the Roche/454 and Illumina sequence coverage, there are ~2x, ~42x, and ~235x of the Roche/454 reads, as well as ~20x, ~1788x, and ~6000x of the Illumina reads for nuclear, mt, and cp DNA, respectively, which reflect a copy number ratio among them as ~1:55:209 on average. This result indicates that only 1.79% reads of similar sequences may be an origin of the nuclear genome in the mt genome datasets, which can be excluded readily during sequence variation identification (alternative allele proportion> = 15%). Similarly, for the cp-derived regions, sequence variations are more likely from cp (79.17%) rather than from the nuclear or mt genomes.
Comparing to the two taxonomically closest species P. dactylifera and B. umbellatus in this study, we only aligned 54.45% and 14.15% of the C. nucifera mt genome, respectively, using bl2seq (S2 Fig) . To further evaluate mt genome variations between the two palm species P. dactylifera and C. nucifera, we used MUMmer to compare the alignable regions and identified 2,442 SNPs and 1,122 indels, coming up with an average rate of 5 variations per 1,000bp (S3 Fig).
RNA editing is universal to almost all plant mt transcripts [64, 65] with features of tissue specific and partial edits . Different species have different RNA editing sites and the number of RNA editing sites ranges from 200 to 600 in angiosperm. The public RNA-Seq data in NCBI are excellent and untapped resources, where we found 8 RNA-Seq datasets from coconut (two of Tall cultivars and 6 of Dwarf cultivars) for our RNA editing analysis . To differentiate sequencing errors and SNPs from editing, we only kept the RNA editing sites with more than 2 supporting reads and with at least 50% edited reads. The criteria lead to the identification of 845 RNA editing sites in 56 protein-coding genes and 36 RNA editing sites are in the cp-derived regions (S7 Table). Among the total RNA editing sites (92 synonymous and 753 nonsynonymous), there are 811 C->T, 26 G->A and 8 T->C sites. We compared tissue disparity among the 8 samples, where healthy leaf1 has the most RNA editing sites (697, 82.49%, 18 unique) and embryo has the least RNA editing sites (489, 57.87%, 22 unique). In addition, 297 RNA editing sites are shared by all 8 samples. Since the 8 samples are from two cultivars, we partitioned the editing sites between the Dwarf and Tall cultivars, yielding 835 and 675 RNA editing sites, respectively, unique to each cultivar and 665 shared. Considering the codon changing edits, we ranked the top six codon changes: TCA->TTA (95, 11.24%), TCT->TTT (67, 7.93%), TCG->TTG (58, 6.86%), CCA->CTA (50, 5.92%), TCC->TTC (45, 5.33%), and CGG->TGG (45, 5.33%); 5 of them changed the second codon position. Moreover, the top six edited codons are TTT (135, 15.98%), TTA (118, 13.96%), TTG (72, 8.52%), TTC (58, 6.86%), CTA (51, 6.04%), and CTT (50, 5.92%).
We also predicted 648 RNA editing sites using PREP-mt program in 45 genes. Comparing the RNA editing sites identified by using the two methods, we have 591 shared, 57 unique to PREP-mt program, and 212 unique to our method; the underestimation of PREP-mt program becomes obvious.
Gene expression analysis based on transcriptome data
Using the RNA-Seq datasets, we obtained mt transcriptomic profiles for the 8 samples (Fig 3 and Table 3). Three healthy leaf samples have the most abundant mapped reads (3.71% to 1.47%), two disease related leaf samples and embryogenic callus fall into the second abundance group (0.29% to 0.24%), whereas endosperm and embryo are the least abundant (0.12% and 0.05%, respectively). Read abundance of mt sequence coincides with tissue characteristics but read coverage shows a different pattern. First, root wilt disease susceptible (RWDS) leaf has the highest read coverage (71.92%) and coconut yellow decline (CYD) leaf has the lowest read coverage (34.94%). Second, healthy leaf samples (54.77% to 68.00%) and embryogenic callus (57.52%) have higher read coverage as opposed to embryo (37.34%) and endosperm (45.28%) (Table 4).
We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (nad in red; cob, matR and mttB in green; cox in blue; atp in purple; ccm in orange; rpl in yellow; rps in dark red; rRNA in dark green; tRNA in dark blue; orf in dark purple; and others in black); histogram of transcriptome data (plus strand in red and minus strand in green, standing for normalized average coverage value per 100 bp ranging from 0 to 100) for sample Health_leaf1, CYD_leaf, Callus, RWDS_leaf, Endosperm, Embryo, Health_leaf2 and Leaf_fruit; coding sequences transcribed in the clockwise and counterclockwise directions; and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogene.
There are 113 out of the total 145 genes expressed in at least two samples whereas only 3 genes (rpo, trna-UUA, and trnI-AAU) expressed in one sample (Young_leaf) (S8 Table). The number of expressed genes is consistent with read coverage. CYD leaf has the least expressed genes (92) as opposed to RWDS leaf that has the most (116). The genes petL and orf247-ct, which have stop codon in the middle of gene sequence, are highly expressed, however, we have not found any RNA editing sites to rescue the normal protein-coding function. Both of them need to be validated in future studies. All pseudogenes have relatively high expression level in all samples other than rpl10. According to transcriptomic profiles, we found 13 polycistronic transcripts among 37 genes (S9 Table). The conservative co-transcribed gene clusters rps12-nad3 and 5SrRNA-18SrRNA are also found in our mt genome.
According to the gene expression profiles (Fig 4), we have observed several obvious features. First, the genes can be divided into three categories according to expression intensity: highly, moderately, and lowly expressed. Second, among 33 highly expressed genes, there are only two tRNA genes (trnI-GAU and trnH-GUG). Third, three of the five rps19 copies are highly expressed and the rest are moderately expressed. Fourth, only nad6 and ccmFn1 of the 25 respiration related genes are lowly expressed.
Our phylogenetic trees are built based on 31 mt protein-coding genes from 19 selected plants (8 monocots and 6 eudicots, as well as one each of gymnosperm, vascular plant, liverwort, hornwort, and moss; Fig 2). The maximum-likelihood (ML) tree has higher bootstrap values than the maximum parsimony (MP) tree except for the node of S. bicolor and Z. mays and the node between P. squarrosus and the group of M. polymorpha and P. patens. Most nodes have bootstrap values larger than 65% except for one node (49%) among Helianthus annuus, V. vinifera and Carica papaya and another node (58%) between P. squarrosus and the group of M. polymorpha and P. patens from ML method. Both methods have high bootstrap values (97% and 85%) for subgroup of C. nucifera, P. dactylifera and B. umbellatus in monocots. Previous studies indicate that date palm appears to be the most basal among monocots [24, 69]. Moreover, date palm has certain miRNA families only found in eudicots . Taken together, these results suggest that Arecaceae separated from the monocotyledon clade earlier than other plant families.
Despite the fact that the C. nucifera mt genome is as large as 678,653bp in length, we have assembled it using a variety of datasets and information, including all plant mt genome sequences, C. nucifera mt sequence datasets from different platforms and libraries with variable insert sizes, and specialized bioinformatics tools suitable for different purposes. The genome sequence variations and RNA editing sites based on transcriptomic data are all invaluable for further biological studies. Phylogenetic analysis indicates that Arecaceae separated from the rest of monocotyledons earlier in flowering plant evolution.
S1 Fig. The homologous mt genes among C. nucifera and 18 other representative plant species.
S2 Fig. A mt genome comparison among C. nucifera, P. dactylifera and B. umbellatus.
S3 Fig. Palm mt and cp genome comparisons between P. dactylifera and C. nucifera (Ref) based on MUMmer.
(A) mt genomes and (B) cp genomes. Unlike the cp genomes, variations between the mt genomes are much higher.
S1 Table. The homologous mt genes among C. nucifera and 18 other representative species.
S2 Table. The cp-derived regions in the C. nucifera mt genome.
S3 Table. Tandem repeats identified in the C. nucifera mt genome by using Tandem Repeats Finder.
S4 Table. Long repeats (> = 100bp, forward and palindrome) identified in the C. nucifera mt genome based on REPuter.
S5 Table. The common variations identified in the C. nucifera by using samtools & bcftools and RGAAT.
S6 Table. The functional evaluation of common variations in the C. nucifera mt genome.
S7 Table. RNA editing sites identified by using RNA-Seq data and PREP-mt program.
NT, nucleotide; AA, amino acid.
S8 Table. Reads of mt genes in the 8 coconut RNA-Seq datasets.
We thank Drs. Xiaowei Zhang and Dr. Yuxin Yin for collecting the coconut samples in the early stage of this project. We acknowledge the public RNA-Seq data produced by other research groups. We also acknowledge all members in the Joint Center for Genomics Research (JCGR) for their helps in completing this project. Technical supports were provided by the CAS Key Laboratory of Genome Science and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, the People's Republic of China. The authors thank the anonymous reviewers and editors for their critical comments and helpful suggestions.
- Conceptualization: WFL QL SNH JY HAA.
- Data curation: WFL QL.
- Formal analysis: WFL QL YHZ JYZ AA IOA AOA.
- Funding acquisition: HAA WFL QL.
- Investigation: WFL QL HAA.
- Methodology: WFL QL.
- Project administration: HAA SNH JY.
- Resources: WFL QL HAA.
- Software: WFL QL.
- Supervision: WFL QL HAA SNH JY.
- Validation: WFL QL AMA AA JY.
- Visualization: WFL QL.
- Writing – original draft: WFL QL HAA SNH JY.
- Writing – review & editing: WFL QL HAA SNH JY.
- 1. Lang BF, Gray MW, Burger G. Mitochondrial genome evolution and the origin of eukaryotes. Annual review of genetics. 1999;33(1):351–97. pmid:10690412
- 2. McBride HM, Neuspiel M, Wasiak S. Mitochondria: more than just a powerhouse. Current Biology. 2006;16(14):R551–R60. pmid:16860735
- 3. Oda K, Yamato K, Ohta E, Nakamura Y, Takemura M, Nozato N, et al. Gene organization deduced from the complete sequence of liverwort Marchantia polymorpha mitochondrial DNA: a primitive form of plant mitochondrial genome. Journal of molecular biology. 1992;223(1):1–7. pmid:1731062
- 4. NCBI RC. Database resources of the National Center for Biotechnology Information. Nucleic acids research. 2013;41(Database issue):D8. pmid:23193264
- 5. Adams KL, Palmer JD. Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Molecular phylogenetics and evolution. 2003;29(3):380–95. pmid:14615181
- 6. Cummings MP, Nugent JM, Olmstead RG, Palmer JD. Phylogenetic analysis reveals five independent transfers of the chloroplast gene rbcL to the mitochondrial genome in angiosperms. Current genetics. 2003;43(2):131–8. pmid:12695853
- 7. Turmel M, Otis C, Lemieux C. The mitochondrial genome of Chara vulgaris: insights into the mitochondrial DNA architecture of the last common ancestor of green algae and land plants. The Plant Cell. 2003;15(8):1888–903. pmid:12897260
- 8. Sloan DB, Wu Z. History of plastid DNA insertions reveals weak deletion and AT mutation biases in angiosperm mitochondrial genomes. Genome biology and evolution. 2014;6(12):3210–21. pmid:25416619
- 9. Vaughn JC, Mason MT, Sper-Whitis GL, Kuhlman P, Palmer JD. Fungal origin by horizontal transfer of a plant mitochondrial group I intron in the chimeric coxI gene of Peperomia. Journal of Molecular Evolution. 1995;41(5):563–72. pmid:7490770
- 10. Mower JP, Sloan DB, Alverson AJ. Plant mitochondrial genome diversity: the genomics revolution: Springer; 2012. https://doi.org/10.1007/978-3-7091-1130-7_9
- 11. Adams KL, Qiu Y-L, Stoutemyer M, Palmer JD. Punctuated evolution of mitochondrial gene content: high and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proceedings of the National Academy of Sciences. 2002;99(15):9905–12. pmid:12119382
- 12. Palmer JD, Herbon LA. Plant mitochondrial DNA evolved rapidly in structure, but slowly in sequence. Journal of Molecular Evolution. 1988;28(1–2):87–97. pmid:3148746
- 13. Gray MW, Burger G, Lang BF. Mitochondrial evolution. Science. 1999;283(5407):1476–81. pmid:10066161
- 14. Lynch M, Koskella B, Schaack S. Mutation pressure and the evolution of organelle genomic architecture. Science. 2006;311(5768):1727–30. pmid:16556832
- 15. Wolfe KH, Li W-H, Sharp PM. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proceedings of the National Academy of Sciences. 1987;84(24):9054–8. pmid:3480529
- 16. Kubo T, Newton KJ. Angiosperm mitochondrial genomes and mutations. Mitochondrion. 2008;8(1):5–14. pmid:18065297
- 17. Wu Z, Cuthbert JM, Taylor DR, Sloan DB. The massive mitochondrial genome of the angiosperm Silene noctiflora is evolving by gain or loss of entire chromosomes. Proceedings of the National Academy of Sciences. 2015;112(33):10185–91. pmid:25944937
- 18. Winkler M, Kück U. The group IIB intron from the green alga Scenedesmus obliquus mitochondrion: molecular characterization of the in vitro splicing products. Current genetics. 1991;20(6):495–502. pmid:1723663
- 19. Gualberto JM, Lamattina L, Bonnard G, Weil J-H, Grienenberger J-M. RNA editing in wheat mitochondria results in the conservation of protein sequences. 1989. pmid:2552325
- 20. Hiesel R, Combettes B, Brennicke A. Evidence for RNA editing in mitochondria of all major groups of land plants except the Bryophyta. Proceedings of the National Academy of Sciences. 1994;91(2):629–33. pmid:8290575
- 21. Malek O, Lättig K, Hiesel R, Brennicke A, Knoop V. RNA editing in bryophytes and a molecular phylogeny of land plants. The EMBO journal. 1996;15(6):1403. pmid:8635473
- 22. Freyer R, Kiefer-Meyer M-C, Kössel H. Occurrence of plastid RNA editing in all major lineages of land plants. Proceedings of the National Academy of Sciences. 1997;94(12):6285–90. pmid:9177209
- 23. Wu Z, Stone JD, Štorchová H, Sloan DB. High transcript abundance, RNA editing, and small RNAs in intergenic regions within the massive mitochondrial genome of the angiosperm Silene noctiflora. BMC genomics. 2015;16(1):938. pmid:26573088
- 24. Fang Y, Wu H, Zhang T, Yang M, Yin Y, Pan L, et al. A complete sequence and transcriptomic analyses of date palm (Phoenix dactylifera L.) mitochondrial genome. PloS one. 2012;7(5):e37164. pmid:22655034
- 25. Schnell RJ, Priyadarshan P. Genomics of tree crops: Springer Science & Business Media; 2012. https://doi.org/10.1007/978-1-4614-0920-5
- 26. Gunn BF, Baudouin L, Beulé T, Ilbert P, Duperray C, Crisp M, et al. Ploidy and domestication are associated with genome size variation in Palms. American journal of botany. 2015;102(10):1625–33. pmid:26437888
- 27. Alsaihati B. Coconut genome de novo sequencing. Plant and Animal Genome XXII Conference; 2014: Plant and Animal Genome.
- 28. Fan H, Xiao Y, Yang Y, Xia W, Mason AS, Xia Z, et al. RNA-Seq analysis of Cocos nucifera: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches. PloS one. 2013;8(3):e59997. pmid:23555859
- 29. Huang Y-Y, Lee C-P, Fu JL, Chang BC-H, Matzke AJ, Matzke M. De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation. G3: Genes| Genomes| Genetics. 2014;4(11):2147–57. pmid:25193496
- 30. Nejat N, Cahill DM, Vadamalai G, Ziemann M, Rookes J, Naderali N. Transcriptomics-based analysis using RNA-Seq of the coconut (Cocos nucifera) leaf in response to yellow decline phytoplasma infection. Molecular Genetics and Genomics. 2015:1–12. pmid:25893418
- 31. Gawel N, Jarret R. A modified CTAB DNA extraction procedure forMusa andIpomoea. Plant Molecular Biology Reporter. 1991;9(3):262–6.
- 32. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC bioinformatics. 2009;10(1):421. pmid:20003500
- 33. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997;25(17):3389–402. pmid:9254694
- 34. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–10. pmid:2231712
- 35. Iorizzo M, Senalik D, Szklarczyk M, Grzebelus D, Spooner D, Simon P. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome. BMC plant biology. 2012;12(1):61. pmid:22548759
- 36. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature methods. 2012;9(4):357–9. pmid:22388286
- 37. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. pmid:19505943
- 38. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93. pmid:21903627
- 39. Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27(8):1157–8. pmid:21320865
- 40. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nature biotechnology. 2011;29(1):24–6. pmid:21221095
- 41. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2012:bbs017. pmid:22517427
- 42. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research. 2014:gku1063. pmid:25392425
- 43. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research. 1997;25(5):0955–964. pmid:9023104
- 44. Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research. 2001;29(22):4633–42. pmid:11713313
- 45. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999;27(2):573. pmid:9862982
- 46. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome biology. 2004;5(2):R12. pmid:14759262
- 47. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014:btu170. pmid:24695404
- 48. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81. pmid:20147302
- 49. Mower JP. The PREP suite: predictive RNA editors for plant mitochondrial genes, chloroplast genes and user-defined alignments. Nucleic acids research. 2009;37(suppl 2):W253–W9. pmid:19433507
- 50. Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8. pmid:17846036
- 51. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: molecular evolutionary genetics analysis version 6.0. Molecular biology and evolution. 2013;30(12):2725–9. pmid:24132122
- 52. Zhang H, Gao S, Lercher MJ, Hu S, Chen W-H. EvolView, an online tool for visualizing, annotating and managing phylogenetic trees. Nucleic acids research. 2012;40(W1):W569–W72. pmid:22695796
- 53. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biol. 2010;11(10):R106. pmid:20979621
- 54. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology. 2011;29(7):644.
- 55. Binder S, Marchfelder A, Brennicke A. Regulation of gene expression in plant mitochondria. Post-Transcriptional Control of Gene Expression in Plants: Springer; 1996. p. 303–14. pmid:8980484
- 56. Tian X, Zheng J, Hu S, Yu J. The discriminatory transfer routes of tRNA genes among organellar and nuclear genomes in flowering plants: a genome-wide investigation of indica rice. Journal of molecular evolution. 2007;64(3):299–307. pmid:17273918
- 57. Stern DB, Lonsdale DM. Mitochondrial and chloroplast genomes of maize have a 12-kilobase DNA sequence in common. Nature. 1982;299(5885):698–702. pmid:6889685
- 58. Stern DB, Palmer JD. Extensive and widespread homologies between mitochondrial DNA and chloroplast DNA in plants. Proceedings of the National Academy of Sciences. 1984;81(7):1946–50. pmid:16593442
- 59. Wang D, Wu Y-W, Shih AC-C, Wu C-S, Wang Y-N, Chaw S-M. Transfer of chloroplast genomic DNA to mitochondrial genome occurred at least 300 MYA. Molecular biology and evolution. 2007;24(9):2040–8. pmid:17609537
- 60. Yang M, Zhang X, Liu G, Yin Y, Chen K, Yun Q, et al. The complete chloroplast genome sequence of date palm (Phoenix dactylifera L.). PloS one. 2010;5(9):e12762. pmid:20856810
- 61. Ebersberger I, Metzler D, Schwarz C, Pääbo S. Genomewide comparison of DNA sequences between humans and chimpanzees. The American Journal of Human Genetics. 2002;70(6):1490–7. pmid:11992255
- 62. Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nöthen MM. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome research. 2003;13(10):2271–6. pmid:14525928
- 63. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. Journal of Computational biology. 2000;7(1–2):203–14. pmid:10890397
- 64. Covello PS, Gray MW. RNA editing in plant mitochondria. 1989.
- 65. Hiesel R, Wissinger B, Schuster W, Brennicke A. RNA editing in plant mitochondria. Science. 1989;246(4937):1632–4. pmid:2480644
- 66. Picardi E, Horner DS, Chiara M, Schiavon R, Valle G, Pesole G. Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing. Nucleic acids research. 2010;38(14):4755–67. pmid:20385587
- 67. Cuenca A, Petersen G, Seberg O. The complete sequence of the mitochondrial genome of Butomus umbellatus–a member of an early branching lineage of Monocotyledons. 2013. pmid:23637852
- 68. Smith DR. RNA-Seq data: a goldmine for organelle research. Briefings in functional genomics. 2013;12(5):454–6. pmid:23334532
- 69. Al-Mssallem IS, Hu S, Zhang X, Lin Q, Liu W, Tan J, et al. Genome sequence of the date palm Phoenix dactylifera L. Nature communications. 2013;4. pmid:23917264
- 70. Xin C, Liu W, Lin Q, Zhang X, Cui P, Li F, et al. Profiling microRNA expression during multi-staged date palm (Phoenix dactylifera L.) fruit development. Genomics. 2015;105(4):242–51. pmid:25638647