Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome

Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.


Introduction
The plant mitochondrial (mt) genome is considered as a remnant of an ancestral α-proteobacterium that was symbiont in its eukaryotic common ancestor [1]. It is involved in cellular energy production by respiration and various cellular function regulations, such as homeostasis, apoptosis, and metabolite biosynthesis [2]. Since the first mt genome of land plants was published (Marchantia polymorpha, liverwort) [3], there had been 303 mt genomes available until December 9, 2015 in the NCBI organelle database [4]. Plant mt genomes have several characteristics that make them important for evolutionary studies. First, plant mt gene contents used for library construction for both single-read and paired-read libraries with 3kb and 8kb insert sizes according to the manufacturer's manual for GS FLX Titanium. The libraries were amplified and sequenced on the Roche/454 GS FLX platform. All Roche/454 data was deposited at BIGD database (http://gsa.big.ac.cn, CRX007340 and CRX007339). The same purified DNA sample was also used for constructing the Illumina HiSeq libraries. HiSeq paired-end (< = 500bp) and mate-pair libraries (1kb to 8kb) were constructed using the Illumina Simple Paired-End Library and Mate-Pair Library Preparation Protocol, respectively. The libraries were sequenced by Illumina HiSeq 2000 platform. HiSeq data used for coconut mt genome correction was deposited at BIGD database (CRX007360, CRX007361 and CRX007362).

Sequence assembly and validation
We first assembled total reads from 13 single-read datasets and 12 paired-read datasets into 573,893 contigs using Newbler 2.6 (with "-a 0" option and default for others), a de novo sequence assembly software. We then aligned the assembled contigs to 234 published land plant mt genomes downloaded from NCBI organelle database at September 16, 2015 using BLAST (identity> = 80%, E-value< = 10 −5 and overlap percent> = 30%) [32][33][34]. We next used 353 annotated contigs (length ranging from 102bp to 49,695bp with median size in 399bp) to build scaffolds using bb.454contignet and manually checked based on contig coverage and spanning reads in Newbler assemblies [35]. We finally obtained a single scaffold of 678,112bp in length without gaps from 143 overlapping contigs.
To correct the sequence errors that are unique to the Roche/454 platform in the assembly, such as homopolymers (characteristic of the pyrosequencing), we used HiSeq paired-end data (180bp insert size) and bowtie2 (version 2.2.4) [36]. The consensus sequence was obtained by using samtools (version 1.2) [37,38] and bcftools (version 1.2) [39]. The length became 678,133bp after this correction. As a byproduct, we identified several pseudogenes due to frame-shifts caused by homopolymers. We checked the final assembly manually based on Roche/454 and HiSeq paired-end data using IGV software (version 2.3.61) and revised 687 loci with 528 indels and 159 SNPs [40,41]. Finally, we obtained a new length of 678,653bp with average sequence depths of~42x for Roche/454 data and~1788x for HiSeq data. We checked this assembly using HiSeq mate-pair data with insert sizes of 5kb and 8kb in a 5kb and 8kb sliding windows, respectively. On average, our final genome assembly was supported by 59.57% and 58.37% mate-pair reads from the 5-kb and 8-kb libraries. The complete mt genome sequence was deposited to GenBank (accession number KX028885).

Sequence variants
Sequence variants were identified based on HiSeq paired-end data with 180-bp insert size. The raw reads were mapped to the final mt genome by using bowtie2 (version 2.2.4) [36], and the variants were called by using RGAAT tool, which developed in our laboratory (https:// sourceforge.net/projects/rgaat/), and samtools and bcftools (version 1.2) [37][38][39]. To eliminate false positives, we only kept the variations identified by both methods. To evaluate the variations between the two palm species, C. nucifera and P. dactylifera, MUMmer3 was used for genome alignment [46].

Transcriptome analysis
We counted the number of reads for each gene for mt genome using an in-house Perl script and identified differentially expressed genes using DESeq (version 1.20.0) [53]. For identifying the novel genes, we used Trinity (version 2.0.6) to construct transcripts based on GSNAP mapping results [54]. If different mt genes were assembled into one sequence, we assigned them to polycistronic transcription unit.

Results and Discussion
The C. nucifera mt genome content We started C. nucifera (Oman local Tall variety) mt genome assembly based solely on the Roche/454 GS FLX data, including 7,617,799 single reads, 2,884,708 paired reads with 3-kb insert size, and 1,594,036 paired reads with 8-kb insert size. After homopolymer correction using the Illumina reads, we have an assembly of 678,653bp in length (Fig 1; see Materials and Methods). It encodes 72 proteins (87 protein-coding genes, 8.62% of mt genome), 9 truncated proteins (codon frameshift mutations; 10 pseudogenes, 0.83% of mt genome), 23 tRNAs (corresponding to 17 amino acid codons and one stop codon, 42 tRNA-coding genes, 0.46% of mt genome), and 3 ribosomal RNAs (6 rRNA-coding genes, 1.51% of mt genome), We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (nad in red; cob, matR and mttB in green; cox in blue; atp in purple; ccm in orange; rpl in yellow; rps in dark red; rRNA in dark green; tRNA in dark blue; orf in dark purple; and others in black); chloroplast-derived regions (green); repeats (forward repeats in green, palindrome repeats in red and tandem repeats in blue); RNA edit sites (synonymous in green and non-synonymous in red); gene conserve scores (black); proper HiSeq mate-pair (MP) reads percent with insert size 5kb and 8kb (blue); and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogenes. doi:10.1371/journal.pone.0163990.g001 Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome which all together constitute a gene content of 11.43% (77,542bp) ( Table 1). Among them, 13 proteins (15 protein-coding genes), 2 truncated proteins (codon frameshift; 3 pseudogenes), 11 tRNAs (corresponding to 10 amino acid codons, 13 tRNA-coding genes) and 3 ribosomal RNAs (3 rRNA-coding genes) locate in the chloroplast-derived regions, which are accounted for 5.07% of the genome sequence. The GC contents of protein-coding genes, pseudogenes, tRNAs, rRNAs, and the remaining non-coding sequences are 44.5% (58,895bp), 47.7% (5,294bp), 41.1% (3,092bp), 53.5% (10,261bp), and 45.5% (601,111bp), respectively. The genome harbors 0.49% tandem (3,310bp) and 17.26% long repeats (!100bp). In addition, there are 13 co-transcribed gene clusters, including conserved 18S-5S rRNA and nad3-rps12 among angiosperm mt genomes [55]. Our phylogenetic analysis shows that C. nucifera clusters with P. dactylifera and Butomus umbellatus among the monocotyledon plants (Fig 2). Protein-coding, rRNA, and tRNA genes The C. nucifera mt genome encodes 50 known functional and 22 hypothetical proteins. Among the first group, 23 proteins are related to the electron transport chain, including 9 subunits of nicotinamide adenine dinucleotide dehydrogenase (complex I), one subunit of succinate dehydrogenase (complex II), one apocytochrome b (complex III), 3 subunits of cytochrome c oxidase (complex IV), 5 subunits of ATP synthase F1 (complex V), and 4 cytochrome c biogenesis proteins (Table 1). First, when compared the C. nucifera mt proteins to 18 other plants (S1 Table and S1 Fig), we identified sdh gene that is unique to the coconut and absent in 7 other monocots. Second, similar in the cases of Vitis vinifera, S. latifolia, and P. dactylifera, RNA polymerase genes are identified in the mt genome (one RNA polymerase and one DNA-dependent RNA polymerase). Third, the C. nucifera mt genome has the highest copy number of rps19 genes (5 copies) in all 19 inspected species, followed by V. vinifera (3 copies). Fourth, there is no rps3 gene in C. nucifera mt genome, whereas it exists in 7 other monocot species. Fifth, rpl10 (pseudogene) and rps11 (protein-coding gene) are found only in P. dactylifera and C. nucifera among all 8 monocots. Last, a few of cp-derived genes are identified in this genome, including 15 proteincoding (such as rpl14, rpl33 and rps14), 3 rRNA, and 13 tRNA genes as well as 3 pseudogenes.
The mt genome contains 42 tRNA genes; 12 of them have introns (9 mt tRNAs and 3 cpderived tRNAs) ( Table 2). Among these tRNA genes, all correspond to 17 amino acids but are absent for the rest three: Ala, Leu, and Val. The tRNA genes for amino acids Thr, His, Arg, Gly, and tRNA Ile (GAU) are only found in the cp-derived regions. These results are consistent with previous studies that the mt tRNA genes are replaced by those of the cp-derived tRNA gradually [24,56].

Cp-derived regions, introns, and repeats
The plant cp and mt genomes are known to have extensive and widespread homologies due to sequence transfer [57,58]. The transfer of cp genomic DNA to that of the mt genome has been going on for at least 300 million years [59]. In the C. nucifera mt genome, there are 33 cpderived regions with a length range of 64 to 3,365bp (S2 Table). The total length of cp-derived regions is 34,395bp and the coding region is 37.58% (12,925bp), which is higher than mt gene content (11.41%) but lower than cp gene content (61.17%). The GC content of the cp-derived regions is 41.9%, which is between those of the cp (37.44%) and mt (45.50%) genomes. A similar trend is found in P. dactylifera with GC contents of 37.23%, 37.40%, and 45.1% for cp, cpderived region, and mt DNA, respectively [24,60]. These results suggest that cp-derived sequences, to some extent, have evolved to be close to the mt genome sequences in GC contents and gene coding fractions after being transferred into mt genomes.

Sequence variation analysis
Based on the HiSeq data, we identified 202 and 157 variations in different places of the genome, using samtools & bcftools and RGAAT (https://sourceforge.net/projects/rgaat/), respectively; among the total, 102 variations are cross-discovered based on both methods (S5 Table). To reduce false positives, we only used the 102 shared variations (100 SNPs and 2 insertions) for further analysis. First, 48 out of the total are found in the cp-derived regions. Among all variations, only 5 SNPs are in the protein-coding genes, including 3 synonymous SNPs of rps1, rps2, and rpl14 (cp-derived) and two non-synonymous SNPs of orf247-ct (S6 Table). Other 6 SNPs and 1 insertion are found in 5 tRNA genes, whereas the remaining 89 SNPs and 1 insertion are non-coding. Second, according to the variation types, there are 23 transitions (Ti) and 77 transversions (Tv), leading to a Ti/Tv ratio of 0.30. If we remove the cp-derived regions from the analysis, the ratio becomes 0.06 (Ti/Tv ratio; 3 Ti and 50 Tv SNPs). It is in sharp contrast to that of the nuclear genome, where the ratios range~2.0-2.1 in genome-wide and 3.0-3.3 in exonic sequences [61,62]. The Ti/Tv ratio in the coconut mt genome is much lower than what is in the nuclear genome, as well as a random prediction (0.5). It supports the observation that DNA replication and repair mechanisms are very different between mt and nuclear genomes. Third, we further scrutinized the data to exclude other possibilities that may lead to biased results. According to the Roche/454 and Illumina sequence coverage, there are~2x,~42x, and 235x of the Roche/454 reads, as well as~20x,~1788x, and~6000x of the Illumina reads for nuclear, mt, and cp DNA, respectively, which reflect a copy number ratio among them as 1:55:209 on average. This result indicates that only 1.79% reads of similar sequences may be an origin of the nuclear genome in the mt genome datasets, which can be excluded readily during sequence variation identification (alternative allele proportion> = 15%). Similarly, for the cp-derived regions, sequence variations are more likely from cp (79.17%) rather than from the nuclear or mt genomes.
Comparing to the two taxonomically closest species P. dactylifera and B. umbellatus in this study, we only aligned 54.45% and 14.15% of the C. nucifera mt genome, respectively, using bl2seq (S2 Fig) [63]. To further evaluate mt genome variations between the two palm species P. dactylifera and C. nucifera, we used MUMmer to compare the alignable regions and identified 2,442 SNPs and 1,122 indels, coming up with an average rate of 5 variations per 1,000bp (S3 Fig). RNA editing RNA editing is universal to almost all plant mt transcripts [64,65] with features of tissue specific and partial edits [66]. Different species have different RNA editing sites and the number of RNA editing sites ranges from 200 to 600 in angiosperm [67]. The public RNA-Seq data in NCBI are excellent and untapped resources, where we found 8 RNA-Seq datasets from coconut (two of Tall cultivars and 6 of Dwarf cultivars) for our RNA editing analysis [68]. To differentiate sequencing errors and SNPs from editing, we only kept the RNA editing sites with more than 2 supporting reads and with at least 50% edited reads. The criteria lead to the identification of 845 RNA editing sites in 56 protein-coding genes and 36 RNA editing sites are in the cp-derived regions (S7 Table). Among the total RNA editing sites (92 synonymous and  We also predicted 648 RNA editing sites using PREP-mt program in 45 genes. Comparing the RNA editing sites identified by using the two methods, we have 591 shared, 57 unique to PREP-mt program, and 212 unique to our method; the underestimation of PREP-mt program becomes obvious.

Gene expression analysis based on transcriptome data
Using the RNA-Seq datasets, we obtained mt transcriptomic profiles for the 8 samples (Fig 3  and Table 3). Three healthy leaf samples have the most abundant mapped reads (3.71% to 1.47%), two disease related leaf samples and embryogenic callus fall into the second abundance group (0.29% to 0.24%), whereas endosperm and embryo are the least abundant (0.12% and 0.05%, respectively). Read abundance of mt sequence coincides with tissue characteristics but read coverage shows a different pattern. First, root wilt disease susceptible (RWDS) leaf has the highest read coverage (71.92%) and coconut yellow decline (CYD) leaf has the lowest read coverage (34.94%). Second, healthy leaf samples (54.77% to 68.00%) and embryogenic callus (57.52%) have higher read coverage as opposed to embryo (37.34%) and endosperm (45.28%) ( Table 4).
There are 113 out of the total 145 genes expressed in at least two samples whereas only 3 genes (rpo, trna-UUA, and trnI-AAU) expressed in one sample (Young_leaf) (S8 Table). The number of expressed genes is consistent with read coverage. CYD leaf has the least expressed genes (92) as opposed to RWDS leaf that has the most (116). The genes petL and orf247-ct, which have stop codon in the middle of gene sequence, are highly expressed, however, we have not found any RNA editing sites to rescue the normal protein-coding function. Both of them need to be validated in future studies. All pseudogenes have relatively high expression level in all samples other than rpl10. According to transcriptomic profiles, we found 13 polycistronic transcripts among 37 genes (S9 Table). The conservative co-transcribed gene clusters rps12-nad3 and 5SrRNA-18SrRNA are also found in our mt genome.
According to the gene expression profiles (Fig 4), we have observed several obvious features. First, the genes can be divided into three categories according to expression intensity: highly, moderately, and lowly expressed. Second, among 33 highly expressed genes, there are only two tRNA genes (trnI-GAU and trnH-GUG). Third, three of the five rps19 copies are highly

Phylogenetic relationships
Our phylogenetic trees are built based on 31 mt protein-coding genes from 19 selected plants (8 monocots and 6 eudicots, as well as one each of gymnosperm, vascular plant, liverwort, hornwort, and moss; Fig 2). The maximum-likelihood (ML) tree has higher bootstrap values than the maximum parsimony (MP) tree except for the node of S. bicolor and Z. mays and the node between P. squarrosus and the group of M. polymorpha and P. patens. Most nodes have bootstrap values larger than 65% except for one node (49%) among Helianthus annuus, V. vinifera and Carica papaya and another node (58%) between P. squarrosus and the group of M. polymorpha and P. patens from ML method. Both methods have high bootstrap values (97%  Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome and 85%) for subgroup of C. nucifera, P. dactylifera and B. umbellatus in monocots. Previous studies indicate that date palm appears to be the most basal among monocots [24,69]. Moreover, date palm has certain miRNA families only found in eudicots [70]. Taken together, these results suggest that Arecaceae separated from the monocotyledon clade earlier than other plant families.

Conclusion
Despite the fact that the C. nucifera mt genome is as large as 678,653bp in length, we have assembled it using a variety of datasets and information, including all plant mt genome sequences, C. nucifera mt sequence datasets from different platforms and libraries with variable insert sizes, and specialized bioinformatics tools suitable for different purposes. The genome sequence variations and RNA editing sites based on transcriptomic data are all invaluable for further biological studies. Phylogenetic analysis indicates that Arecaceae separated from the rest of monocotyledons earlier in flowering plant evolution.