RNA-Seq Analysis of Cocos nucifera: Transcriptome Sequencing and De Novo Assembly for Subsequent Functional Genomics Approaches

Background Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. Methodology/Principal Findings To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Conclusions/Significance Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species.


Introduction
Cocos nucifera is an economically important crop species widely grown in tropical and subtropical regions. The coconut tree is highly versatile, with extensive applications in the fields of agriculture and industry: food, fiber, oil, soil fertilizers, spa ingredients, furniture, fashion accessories, garments, construction and building materials, oleochemicals and biofuels. Cocos nucifera is generally separated into two types based on morphological characteristics: tall and dwarf coconut. The types differ in life span, with 4-6 years required for inflorescence production in dwarf coconut and 8-10 years in tall coconut [1]. This long life-cycle is a major limiting factor in improving the yield and quality traits of Cocos nucifera through conventional breeding approaches. The discovery of novel genes and the development of molecular markers linked to quality and agronomic traits may speed up the pace in Cocos nucifera breeding. However, it is difficult to isolate functional genes which govern important quality and agronomic traits in coconut due to the scarcity of available genetic sequences. Currently, there are only 774 sequences available in the National Center for Biotechnology Information (NCBI) database originating from Cocos nucifera.
Transcriptome sequencing is an efficient methodology for large scale gene discovery [2]. However, transcript sequencing was previously primarily dependent on the construction of clone libraries [3]. With the development of new sequencing technologies, direct sequencing of cDNA fragments can now be accomplished free of cloning. Compared with earlier cloning methods, massively parallel sequencing of RNA (RNA-Seq) has a dramatically increased RNA sequencing output, as well as allowing global measurement of transcript abundance [4,5]. Hence, RNA-Seq is a positive driving force for characterizing the transcriptome and elucidating transcriptome complexity. Today, RNA-Seq is performed based on next-generation sequencing technology to generate millions of short cDNA reads. Following this, two strategies can be applied to construct transcripts using these short reads. The first strategy, proposed by Trapnell et al. [6] and Guttman et al. [7], is an 'align-thenassemble' approach: based on this method, the transcript can be reconstructed by aligning the short reads to the genome and then accounting for possible splice events. The second strategy is called 'assemble-then-align': de novo assembly is used to construct transcripts, and then the assembled transcripts are aligned to the genome to elucidate intron and exon structure and variations between alternatively spliced transcripts [8,9]. Moreover, de novo assembly is also applied to construct the transcriptome for these no-model organisms without genome sequences. In order to perform de novo assembly, very abundant short reads are required. RNA-seq has enhanced our understanding of transcriptomes in the animal and plant kingdoms, revealing novel transcriptome sequences, alternative splicing, transcript isoforms, new large intergenic noncoding RNAs and single nucleotide polymorphisms (SNPs) [10,11]. RNA-seq analysis provides a promising avenue for capturing genome-wide transcripts and splicing in unprecedented detail.
In this study, we used Illumina RNA-Seq technology to generate 54,931,406 short reads containing a total of 4,943,826,540 nucleotide bases. Due to the scarcity of reference sequences available in the National Center for Biotechnology Information (NCBI) and other databases, de novo assembly was applied to combine these short reads: 57,304 unigenes were obtained with an average length of 752 bp. These unigenes were aligned with sequence databases for subsequent Blast, mapping and annotation analyses. These results will provide a platform of sequence information for global discovery of novel functional genes in Cocos nucifera.

Plant Materials and RNA Isolation
Spear leaves, young leaves and fruit flesh were sampled from Hainan Tall cultivars, for total RNA isolation. High quality RNA was obtained using the MRIP (Methods for RNA Isolation from Palms) protocol [12], which contains an extraction buffer based on that proposed by Bilgin et al. [13], but modified for suitability in Palmaceae RNA extractions. RNA integrity was confirmed using agarose gel electrophoresis and a Agilent 2100 Bioanalyzer (Agilent Technologies). Isolated RNAs from different tissues were mixed for further processing. Beads with dT oligos were used to isolate poly (A) mRNA from the total RNA (Qiagen GmbH, Hilden, Germany).

Synthesis of cDNA and Subsequent Sequencing
The purified mRNA was fragmented with divalent cations under increased temperature. These short fragments were taken as templates to synthesize the first-strand cDNA using random hexamer primers and superscript TM III (Invitrogen TM , Carlsbad, CA, USA). Second-strand cDNA was then synthesized in a solution containing buffer, dNTP, RNaseH and DNA polymerase I and subsequently purified using a QiaQuick PCR extraction kit (Qiangen). EB buffer was used to resolve these short fragments for end reparation and poly (A) addition. The sequence adaptors were linked to two ends of short cDNA sequences and suitably sized cDNA fragments were selected out for PCR amplification based on the agrose gel electrophoresis results. Finally, the library established was sequenced with an Illumina Hiseq TM 2000. The paired-end library was developed according to the protocol of the Paired-End sample Preparation kit (Illumina, USA).

De novo Assembly
Transcriptome de novo assembly was carried out using the short read assembly program ''Trinity'', following the protocol documented in Grabherr et al. [14] and Xiao et al. [12].

Annotation and Classification of Unigenes
Unigenes were used for BLAST searches and annotation against the NCBI Nr database (NCBI non-redundant sequence database) using an E-value cut-off of 10 25 (E-value ,0.00001). Unigene sequences were also aligned by BLASTX to protein databases such as Swiss-Prot, KEGG and COG, in order to retrieve proteins with the highest sequence similarity to the given unigenes along with putative functional annotations. If results of different databases conflicted, Nr then Swiss-prot database results were given precedence. For unigenes that did not align to any of the above databases, ESTScan software [15] was used to predict their coding regions and determine sequence direction. Unigenes aligned to databases with higher priority will not be aligned to lower priority database. The alignments were considered complete when all four assignments were finished. Coding sequence regions were then determined for the highest-ranked proteins using BLAST. Unigenes that could not be aligned to any database were scanned by ESTScan [15] to determine the nucleotide (59-39) and amino acid sequences of the coding regions.
The Blast2GO program was used to obtain GO annotations for the unigenes, as well as for the KEGG and COG analyses [16]. The WEGO software was then used to perform GO functional classification of all unigenes, in order to view the distribution of gene functions within coconut at the macro level [17]. This analysis mapped all of the annotated unigenes to GO terms in the database and calculated the number of unigenes associated with every GO term. COG and KEGG pathway annotations were performed using Blastall software against the COG and KEGG databases.

Sequencing and Assembly Quality Statistics
To globally elucidate the transcriptome of Cocos nucifera, RNA samples from different tissues were mixed in equal proportions and then used for mRNA preparation, fragmentation and cDNA synthesis. Two independent Illumina sequencing runs generated a total of 54,931 406 short sequence reads consisting of 4,943,826,540 nucleotides (nt) total, with an average length of 90 bp for each short read (Table 1). Due to the absence of reference genomic sequences, de novo assembly was applied to construct transcripts from these RNA-seq reads. De novo assembly was performed using Trinity (insert reference or company here), a de novo assembler of RNA-seq. 127,952 contigs were constructed from the raw sequence reads. As shown in Figure 1A, the sequence length of these assembled contigs ranged from 100 bp to more than 3000 bp, with an average length of 344 bp. The number of contigs decreased with increasing contig length (Table 1, Figure 1A).
Using paired-end reads, contigs from the same transcript were connected and extended in both ends to get the longest unigenes possible. In total 57,304 unigenes were generated, with an average unigene length of 752 bp and ranging in size from 200 bp to more than 3000 bp (Table 1 and figure 1B). These assembled unigenes have been deposited in the NCBI database.

Comparison of Assembled Unigenes with EST Sequences from Cocos Nucifera and Elaeis Guineensis Available in the NCBI Database
To estimate the accuracy of the de novo assembly performed in the research, the assembled sequences (57 304 unigenes) were blasted against available Cocos nucifera cDNA sequences downloaded from the NCBI database in February 2012. These comprised 115 ESTs (JG390663-JG390776, AF241736, AM259062-AM259066) generated from water stressed plantlets. Due to the limited number of EST sequences available for Cocos nucifera, all available EST sequences from related species Elaeis guineensis (41,977 reads) were also downloaded from the NCBI database for sequence alignments with the 57,304 unigenes. Using a cut-off E-value of 10 25 , 46 unigenes (0.1%) could be aligned to the downloaded Cocos nucifera cDNA sequences (Figure 2A), and 31% of the unigenes could be aligned to the Elaeis guineensis cDNA sequences ( Figure 2B). Only 47% of the NCBI EST sequences (all from water-stressed plantlets) of Cocos nucifera could be matched with unigenes ( Figure 2C), but 79% of the EST sequences from Elaeis guineensis could be matched with unigenes ( Figure 2D).
A high degree of similarity between the assembled unigenes and the available EST sequences of Cocos nucifera and Elaeis guineensis was detected. In Cocos nucifera and Elaeis guineensis, 83% and 88% of the matched sequences respectively had an E-value lower than 10 251 ( Figure 2E, F), and 91% and 83% had alignment identities higher than 90%, supporting the reliability of the de novo assembly performed in the study ( Figure 2G, H).  In order to estimate the coverage of the coconut transcriptome over all genes in the species, we downloaded the whole genome sequence of Phoenix dactylifera (a relative of Cocos nucifera) from the website (http://qatar-weill.cornell.edu/research/ datapalmGenome/download.html). There are 28,889 genes predicted in the Phoenix dactylifera genome [18]. Using a cut-off E-value of 10 25 , 20,541 (71.1%) of genes predicted from the genome sequence matched to the coconut transcriptome (Table S1), which suggests that the coconut transcriptome may have high coverage over all genes in the genome.

Annotation and Classification of Cocos Nnucifera Unigenes
To annotate the Cocos nucifera transcriptome, unigene sequences were first searched against the NCBI non-redundant (Nr) protein database using a cut-off E-value of 10 25 . Out of 57,304 unigenes, 39 109 unigenes (68%) were matched to known protein sequences in the Nr database (  (Table 3), Zea mays, Oryza sativa (model system for monocot plants) and Arabidopsis lyrata, Cocos nucifera unigenes had the highest identities with the Oryza sativa protein database (Table 3). At a cut-off E-value of 1e 25 , 34,708 (61%) of Cocos nucifera unigenes had matches to the Oryza sativa protein database, 59% to the Arabidopsis thaliana database, 34% to the Zea mays database and 30% to the Arabidopsis lyrata database. We also found that 17,886 (31%) of unigenes had matches to the Oryza sativa protein database with an E-value ,1e 250 . At this E-value 22%, 16% and 14% of unigenes matched to the Arabidopsis thaliana, Zea mays and Arabidopsis lyrata databases respectively.
Gene Ontology (GO) terms were subsequently assigned to Cocos nucifera unigenes based on their sequence matches to known protein sequences in the Nr database. A total of 15,178 unigenes (26.50%) were assigned at least one GO term (Table 2), amongst which 8,022 (14.00%) were in the biological process category, 10,986 (19.17%) were in the cellular component category and 9,532 (16.63%) were in the molecular function category. Based on GO annotation, Cocos nucifera unigenes were categorized into 44 groups using a set of plant-specific GO categories. Metabolic processes (5,538 unigenes, 9.66%), cell (10,874 unigenes, 18.98%) and catalytic activity (6,531 unigenes, 11.40%) were the most abundant GO slims in each of the biological processes, cellular component localization and molecular functionality categories respectively (Figure 3). The high number of unigenes putatively involved in metabolic processes (5,538) and cellular processes (5,285) in the biological processes category indicated that the Cocos nucifera tissues used in the study were undergoing extensive metabolic activities. Also, a large number of expressed genes were involved in biological regulation (1,522), developmental processes (1,087), establishment of location (1,341), location (1,485), response to stimulus (1,751), cell part (9,905), organelle (8,265), and binding (6,090), while only a few unigenes were clustered in terms of adhesion, cell killing, locomotion, nitrogen utilization, pigmentation, rhythmic process, viral reproduction, cell junction, extracellular region part, virion, and antioxidant activity. Figure 4 shows the GO classification of coconut transcriptome from RNA mixture of leaves and fruits.
To further identify the active biochemical pathways in the leaves and fruit of Cocos nucifera, we mapped the Cocos nucifera unigenes to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG). KEGG is thought to provide a basic platform for systematic analysis of gene function in terms of the network of gene products. A total of 23,168 unigenes were annotated based on a BLASTX search of the KEGG database (Table 2): 215 biosynthesis pathways were predicted. Of these 215 KEGG pathways, the metabolic pathway was the largest, containing 6,356 members (unigene products).

Genes Potentially Involved in Fatty Acid Biosynthesis and Metabolism
As well as being a fruit and food crop, Cocos nucifera is an important tropical oil crop. The oil content of coconut flesh is generally between 64% and 75% dry weight, of which more than 50% is lauric acid. Hence, we further characterized the assembled unigenes with regards to biochemical pathways related to fatty acid biosynthesis and metabolism: 347 unigenes were classified as being involved in five steps of the fatty acid biosynthesis and metabolism pathway (fatty acid biosynthesis, unsaturated fatty acid, citrate cycle, fatty acid metabolism and fatty acid elongation) ( Table 4). Of the 347 unigenes, 121 were involved in fatty acid biosynthesis, and then followed by fatty acid metabolism (94), fatty acid biosynthesis (81), unsaturated fatty acid (41) and fatty acid elongation.

Discussion
Coconut palms can provide almost all the necessities of life in the tropical regions, such as food, drink, oil, etc. However, due to the long life cycle of the coconut palm and the lack of genetic sequence information, efforts to improve the agronomic characteristics of Cocos nucifera through molecular breeding have made little progress. Transcriptome sequencing can provide abundant sequence information, as well as shedding some light on the basic biological processes occurring in cells [19,20]. With the advent of next-generation sequencing technology, transcriptome analysis has been widely applied to many different species [21,22]. However, there is still no available transcriptome sequencing data for coconut. In this study, we utilized RNA-seq technology to sequence the coconut transcriptome, and discovered a large number of novel genes in Cocos nucifera. A total of 57,304 unigenes were obtained, representing a comprehensive transcriptome of Cocos nucifera. Functional annotation showed that these unigenes covered every basic biological process, and 23,168 of these unigenes were also mapped into 215 KEGG pathways. We provide comprehensive sequence resources for coconut for the first time, which will benefit subsequent functional genomics and discovery of novel genes in Cocos nucifera.
In this study, Illumina sequencing produced 54,931,406 short reads (average 90 bp each short read), totaling 4.9 Gb of sequence. Generally, there are two alternative computational methodologies for establishing transcriptomes based on these short reads: the mapping-first approach, which is carried out based on an available reference genome, and the assembly-first (de novo) approach. Due to absence of reference sequences for Cocos nucifera, de novo assembly was applied to connect these short reads to form longer expressed sequences. A number of assembly software programs are available for this purpose, such as ABySS [23], SOAPdenovo [24,25] and Oases [26]. Previous whole transcriptome assemblies in plants have been done using SOAPdenovo software, a whole-genome assembler with utility for assembling short reads generated by Illumina sequencing. The accuracy of the de novo assembly performed using SOAPdenovo software is strongly dependent on the user-defined parameter of overlap sequence length (k-mer), meaning the overlap required between two short reads in order to consider them contiguous. Wang et al. [27] used 21-mers to assemble the transcriptome of whitefly, and obtained an average unigenes length of 226 bp using this parameter. When k-mer = 23 was used to connect short sequence reads generated by Illumina in another study, 74,336 assembled transcript sequences were obtained, with an average of 439.5 bp in sequence length. Xia et al. [28] applied a more rigorous k-mer parameter (k = 29) to assemble trascriptome fragments, and obtained an average unigenes length of 436 bp. In our research, de novo assembly was carried out by using Trinity, a de novo assembler of RNA-seq. Compared to other de novo assemblers of RNA-seq, Trinity software is thought to be more suitable for constructing a de novo transcriptome without a reference genome, as it utilizes the kmer graph method to assemble Illumina RNAseq reads [14]. Using Trinity software, 54,931,406 short reads were assembled into contigs, which were subsequently connected into unigenes. The average sequence length of the unigenes was 752 bp, which is longer than previously documented sequence lengths using other de novo assembler programs [29,30,31]. In addition, the sequence length of 13,795 (24.1%) unigenes in our study was more than 1000 bp. Hence, our work provides some full-length RNA-seq data (putative complete genes), which will be advantageous for further gene function validation and functional genomics approaches in Cocos nucifera.
It is essential to evaluate whether unigene assemblies are reliable, particularly in de novo sequencing. Xia et al. [28] assembled Illumina short reads of Hevea brasiliensis, and matched the assembled unigenes to known H. brasiliensis cDNA sequences to confirm the accuracy of the de novo assembly: 31% of matches had an E-value lower than 10 250 . In our study, due to the scarcity of available cDNA sequences from Cocos nucifera, the assembled unigenes were separately matched with EST sequences from both Cocos nucifera and Elaeis guineensis: 83% and 88% of matched sequences for Cocos nucifera and Elaeis guineensis respectively had an E-value lower than 10 251 . This high degree of sequence match similarity indicated that the de novo assembly we performed using the Trinity software was very accurate.
Out of 57,304 unigenes, 39,109 (68.2%) were matched against the NCBI non-redundant (Nr) protein database, allowing further functional annotation and classification using GO, COG and KEGG. This functional annotation provided predicted information for biological function and biosynthesis pathways for the assembled unigenes [32,33]. Although functional annotation and classification provide predicted functions for these coconut unigenes, ongoing studies are still required for further function validation. Meanwhile, we found that 34,708 (60.57%) of Cocos nucifera unigenes had matches in the Oryza sativa protein database, which is the highest level of similarity across all the model plant databases. This result is in good accordance with the sequence alignment between rice and data palm, another Palmaceae species, made by Al-Dous [18]. This suggests the Cocos nucifera genome is more closely related to the Oryza sativa genome than to other model plant genomes. Thus, the sequence database of Oryza sativa should be considered as a reference for molecular biology research in Cocos nucifera.
Coconut is an important tropical oil crop, providing edible fatty acid for millions of people. Compared with other oil-crops, coconut oil contains a large proportion of lauric acid (more than 50%), a twelve carbon saturated fatty acid which is almost undetectable in rapeseed, peanut and soybean. Previous research has investigated phenotypic variation for oil content and fatty acid composition in Cocos nucifera [34,35]. However, the molecular basis of lauric acid accumulation and its regulation are still unclear. In this study, we identified 347 unigenes involved in the biosynthesis and metabolism of fatty acid. These unigenes could be assigned to five steps of the fatty acid biosynthesis pathway, providing a means of elucidating the molecular mechanisms for fatty acid biosynthesis in coconut palm. It is noteworthy that 20 unigenes were predicted to be related to fatty acyl-ACP thioesterase, which is a crucial enzyme for terminating the elongation of carbon chains and therefore regulating the length of fatty acids. Therefore, it is possible that the expression of fatty acyl-ACP thioesterase is correlated with the observed accumulation of medium chain fatty acids (i.e. lauric acid) in Cocos nucifera.
Non-coding RNA (ncRNA) genes can produce functional RNA molecules, rather than encoding proteins [36]. With the technological breakthrough of deep-sequencing, large-scale genome sequencing combined with RNA-seq studies have led to the discovery of numerous non-coding RNAs (ncRNA). Recently, ncRNA has been shown to act in transcriptional and posttranscriptional regulation [37]. In addition to the regulation of gene expression, ncRNA may also play a positive role in maintaining genome stability [38]. Due to the lack of available genome sequence in coconut, we compared the coconut transcriptome to the whole genome sequence of date palm (a species related to coconut). At an E-value cut-off 10 250 , 4,433 unigenes have matches to the intergenic sequences of the date palm genome. Hence, some unigenes are putative candidates for non-coding RNA in Cocos nucifera. However, only 10 unigenes of the 4,423 matching to intergenic sequences could be matched to plant databases of non-coding RNA. Subsequent studies would be required to validate the putative non-coding RNA and to ascertain the biological functionality of these RNA sequences.

Conclusions
We applied Illumina next-generation sequencing and de novo assembly to elucidate the coconut transcriptome for the first time. In total, 57,304 unigenes were obtained, with an average unigene length of 752 bp. Of these unigenes, 99.9% were novel compared to released coconut EST sequences. Of the 57,304 unigenes, 23,168 were mapped into 215 KEGG pathways, including galactose metabolism, plant-pathogen interaction and plant hormone signal transduction pathways. In addition, we also found that 347 unigenes were involved in fatty acid synthesis and metabolism. These unigenes were classified into five steps of the fatty acid biosynthesis and metabolism pathway (fatty acid biosynthesis, unsaturated fatty acid, citrate cycle, fatty acid metabolism and fatty acid elongation). The computational prediction of unigene function will provide primary information for genes involved in different biological processes, allowing these unigenes to be treated as candidate genes in future research.
Overall it can be concluded that the RNA-seq analysis and de novo assembly provided a global view of the Cocos nucifera transcriptome. The availability of this annotated transcriptome will be a step forward for the isolation and characterization of functional genes involved in different biochemical pathways, as well as paving the way for molecular genetic approaches to Cocos nucifera breeding.