De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms

Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to “response to stress” and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.


Introduction
Mangroves belong to different families or orders but have adapted to the same unstable environment of intertidal zones [1]. Among the approximately 70 mangrove species, the "mangrove palm" Nypa fruticans (Thunb.) Wurmb. is the only monocot species [1,2]. This species is the sole species of genus Nypa and a basal species in the family Arecaceae (exclude subfamily platform. We obtained 24,607,427 raw paired-end reads and deposited them in the NCBI Short Read Archive (SRA) with the accession number SRR2016566. After filtering low quality reads, 19,918,800 clean reads were used and assembled into 51,702 contigs using Trinity [15]. Trinity includes three independent modules (Inchworm, Chrysalis and Butterfly) to assemble contigs step by step. It was widely used in the recent transcriptome analyses [10,16,17] because of its better performance in the quality of assembles compared to other computer programs [18]. We removed redundancy in these contigs, treated the remaining 45,368 contigs as unigenes and deposited them in NCBI GenBank under the accession number GDFT01000000. After mapping short reads to the unigenes, we calculated and plotted the mean coverage of each unigene (i.e., the average sequencing depth for all sites in a gene, Fig 1A). It showed that all unigenes were among the normal range compared to other studies [10,16,17]. The mean, median length and N50 value of the unigenes were 722 bp, 460 bp and 1,096 bp, respectively ( Table 1). The N50 value indicates that half of the entire assembly is contained in sequences equal to or larger than this value. The high value of N50 indicates a high quality assembly. The N50 value of unigene was comparable to other studies (e.g. 956 bp of Orchis italic [16], 1,219 bp of Cocos nucifera [10] and 844 bp of Ammopiptanthus mongolicus [17]), indicating the good quality of the assembly. Total 21,220 unigenes were longer than 500 bp and were well assembled ( Fig 1B). The GC content of N. fruticans was 47.7%, which was very close to that for the date palm P. dactylifera (47.6%) [19].

Annotation of the coding sequences
To obtain the functional annotation of each unigene, we firstly performed BLASTX against the NCBI non-redundancy database; 32,260 (71.11%) unigenes were matched (Table 1, S1 Table). Protein signature recognition was performed via InterProScan (http://www.ebi.ac.uk/interpro/), and a total of 24,650 (54.33%) unigenes had results. After merging the results of BLASTX with those of InterProScan, 18,761 (41.35%) unigenes were well-annotated via the analysis pipeline of Blast2GO (Table 1, S2 Table).
Gene ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes. We retrieved GO annotations for 18,761 unigenes from the GO database and analyzed the enrichment of level-2 GO terms in three categories: cellular De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) component, molecular function and biological process (Fig 2). For the cellular component category, 'cell,' 'cell part' and 'organelle' were the most abundant. For molecular function, 'catalytic activity' and 'binding' were overrepresented, and for the biological process, the two most highly represented categories were 'metabolic process' and 'cellular process.' In biological process, there were 1,198 unigenes annotated as "response to stress" (GO: 0006950). Among these, 131 unigenes were assigned to "response to osmotic stress" (GO: 0006970), 315 unigenes to "response to DNA damage stimulus" (GO: 0006974), and 178 unigenes to "response to oxidative stress" (GO: 0006979), all of which may be important for N. fruticans' adaptation to an environment with high salinity, oxidation and UV light (S3 Table). Genes about "response to osmotic stress" were important for responding the increase or decrease of the concentration of solutes outside the organism or cell, which were helpful for Nypa to adapt to the high saline environment. Genes about "response to oxidative stress" would be important to reduce reactive oxygen species (ROS) and cell damage. When plants are subjected to stresses like high salinity, drought, high UV and flood, their balance between the productions of ROS and the quenching activity of antioxidants is upset, often resulting in oxidative damage. Genes about "response to DNA damage stimulus" would be related to dealing with the high UV conditions in the intertidal zone.
The pathway annotation was conducted via Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. The KEGG database contains data from a systematic analysis of inner-cell metabolic pathways and functions of gene products. Pathway-based analysis is helpful for understanding the biological functions and interactions of genes. In total, 2,956 unigenes were assigned to 134 pathways. The most enriched pathways were "Purine metabolism" (1,057

Detecting positively selected genes
Understanding the adaptive evolution of species, especially non-model woody plants such as mangroves, has long been limited by the lack of genomic data. Despite their important ecological and evolutionary value, evidence for positive selection among mangroves and palms is scarce. Under positive selection, the frequency of advantageous allele increases because the individuals with this allele get a better phenotype to survival and reproduce more offspring than others. The positively selected genes are more likely fixed in the species. To detect the candidate positively selected genes, sequences of other species can be used to compare with the target species. Hence, we introduced the published genome sequences of P. dactylifera, E. guineensis and O. sativa to compare to the coding sequences of N. fruticans by the improved branch-site model. This model used the likelihood method to detect the candidate positively selected sites in the target species [20]. Among 2,136 single-copy orthologous genes, 15 putative positively selected genes were identified in Table 2. Four genes are of special interest. The orthologous gene of Nfr_c21523_g1_i2 in rice has a zinc finger domain, and its orthologous gene in Arabidopsis thaliana is involved in salinity stress [21]. The orthologous gene of Nfr_c19525_g1_i1 was also involved in response to salt stress [22]. These genes should be important for the adaptation of N. fruticans to a high-saline environment. The other two genes may play important roles in the formation of vivipary in N. fruticans. The embryo of viviparous plants grows sufficiently to emerge visibly from within the seed tissue before dispersal. The formation of vivipary are related to the speed of embryo development, the regulation of seed dormancy and the mechanism of seed detaching from the parent. The Arabidopsis ortholog of Nfr_c19922_g2_i1 is essential for embryo development [23,24], while the Arabidopsis ortholog of Nfr_c20352_g2_i1 (the hydroxysteroid dehydrogenase HSD1) was related to the lipid biosynthesis in seeds [25,26]. The seed dormancy was terminated when HSD1 was overexpressed [26,27]. These two genes about embryo development and seed dormancy may be related to the special feature vivipary. Since the two genes are related to other functions not only in embryo [27,28], they also expressed in leaves with a normal expression level (the mean coverage of the two genes are 17.8 and 42.4, respectively).
Salt tolerance and vivipary germination are the most significant and fundamental adaptations for true mangroves to successfully occupy intertidal zones. Although N. fruticans and the compared palm species both originated in tropical and sub-tropical areas, they inhabit different ecological habitats-intertidal and terrestrial zones, respectively. Positive selection on these salinity stress and vivipary seeding-related genes may have contributed to woody plant adaptations to coastal environments.

Identification of simple sequence repeats
Simple sequence repeats (SSRs), also termed microsatellites, are nucleotide motif containing tandem repeat of two to five or six base pairs in DNA sequences. They are ubiquitous found in both protein coding and non-coding regions. SSRs are important molecular markers used in gene mapping, molecular breeding, genetic diversity, and discrimination. Based on the 45,382 unigenes, a total of 7,958 SSRs were identified in 6,541 unigenes (S5 Table). Approximately 14.4% transcriptomic sequences possess SSR loci (S5 Table). The mean SSR density was one per 4,122 bp. The most abundant type of repeat motif was di-nucleotide (55.74%), followed by tri-nucleotide (41.76%) and tetra-nucleotide (2.19%, Table 1). The percentages of penta-nucleotide and hexa-nucleotide were very low (0.13% and 0.19%, respectively). To further compare the motif frequency between different palms, we also detected the SSR motifs of three other palms (date palm P. dactylifera, oil palm E. guineensis and coconut C. nucifera) using their To validate the predicted SSRs, total 36 pairs of PCR primers for the randomly picked SSRs were designed and used for PCR. Twenty-five of them successfully got the PCR products. We sequenced these 25 PCR products using Sanger sequencing. All sequencing results were the same as the genes assembled by Trinity. The predicted SSRs were 100% validated (S6 Table). The total 25 sequenced gene segments are all homozygote, suggesting the low heterozygosity of N. fruticans.

Phylogenetic tree and divergence time
Nypa represents an independent line of specialization within the palm family and has no discernable close relatives due to its numerous distinctive characteristics [1]. Nypa has even been suggested as a distinct family (Nypaceae) [1]. To perform a phylogenetic analysis, we collected and downloaded the genomes and transcriptomes of four other species: the three sequenced palms (P. dactylifera, E. guineensis, C, nucifera) and Oryza sativa as an out-group species. A total of 22,168 ortholog groups were generated using OrthoMCL software [29], and 1,347 single-copy orthologs were retrieved to reconstruct a phylogenetic tree and calculate the divergence time. The phylogenetic tree was drawn with the best-fit model (Fig 4A). The phylogenetic tree was 100% supported by 1,000 bootstraps analysis. It showed that N. fruticans was the species divergent from the ancestor of other three palms, which was consist with previous studies. The branch lengths of palms were much shorter than that of O. sativa, suggesting the mutation rate became smaller in the lineage of palms. The divergence time between N. fruticans and other palms was estimated at approximately 75.5 Mya (million years ago) (Fig 4B).
The paleobiogeography of Nypa has long been intensively investigated because it is one of the oldest palms and one of the first monocots in the fossil plant record [2,5]. Our estimation of its divergence time from other palm relatives, as well as its first appearance in the form of pollen in the Campanian (Late Cretaceous, 72.1-83.6 Mya [5]), indicate the fact that this genus firstly appeared at least in Late Cretaceous. Nypa, as well as Acrostichum, Pelliceria and now extinct species Brevitricolpites variabilis, are the first members found to inhabit the brackish water environment [2]. The earliest unequivocal fossil palm materials probably dated from the Early to Mid-Late Cretaceous, and the expanding and retracting distribution of fossils in higher latitudes have been important indicators of global changes [6]. The large number of fossils of N. fruticans showed maximum expansion during the Early Eocene, perhaps corresponding with the climate optimum. However, the cooling temperature and seasonal climate reduced its cosmopolitan range. This species made its last appearance in the New World in the Late Eocene and it is currently restricted to the Old World [2].

Whole-genome duplication events
Whole-genome duplication events could provide functional and evolutionary materials for speciation and adaptation. A previous study found two genome-wide duplication events of the date palm P. dactylifera [8]. Whether N. fruticans share these two WGD events is an interesting question. The distribution of synonymous substitution rate (number of synonymous nucleotide substitution per synonymous sites, Ks) of paralogs could be used for identification of WGD event. The larger Ks, the longer divergence time between paralogs. For a species that experienced a WGD event, most genomic sequences were doubled at the same time. The mutations were then accumulated between the paralogs introduced by WGD. The peak of Ks distribution would be a good indicator of WGD event, since it state a large number of genes duplicated in a narrow time range.
In Fig 5A, the two peaks (I and II) of the Ks distribution of P. dactylifera (blue bars) showed two WGD events, as is consistent with the descriptions in Al-Mssallem et al. [8]. The red bars indicated Ks values of the paralogs of N. fruticans, suggesting that the two WGD events occurred near the same time of P. dactylifera despite the fact that peak I of N. fruticans was weaker than that of P. dactylifera. Peak I shows that the generally accepted duplication event of flowering plants happened approximately 150 Mya [30], whereas peak II on the left suggested a recent WGD event. Peak III of the Ks of orthologs between the two palms (yellow bars) is to the left of the peak II, suggesting that the two palms shared the same two WGD events and diverged shortly after the recent duplication event. To investigate the WGDs in other palms, we calculated and showed the Ks distributions of E. guineensis and C. nucifera (Fig 5B). E. guineensis (green bars) showed two WGD events as P. dactylifera; the two peaks were slightly right-shifted compared with those of P. dactylifera. This may have been due to the relatively high substitution rate and could be consistent with the longer branch of E. guineensis in Fig 4A. The Ks distribution of C. nucifera produced peak II, whereas peak I was too weak to detect. The transcriptome redundancy of C. nucifera may contribute to the leftmost peak being near zero. In summary, we detected two WGD events-an ancient one (approximately 150 Mya) and a recent one (prior to the divergence of palms, which likely occurred in the common ancestor of all palms) based on the available resources.
If the ancient WGD event was dated to approximately 150 Mya and the synonymous substitution rate has remained constant, the divergence time of the two palms should be approximately 40 Mya, which differs from our divergence estimation and fossil records. Hence, the mutation rate of palms had been decreasing during their evolution, which may be associated with a shift to the tree/shrub habit [31]. The short branch lengths of the four palms also support this conclusion (Fig 4A).

Conclusion
N. fruticans is the only monocot species that has adapted to intertidal zones. In this study, we provide a well-annotated transcriptome of N. fruticans with 45,368 unigenes, of which 71.11% had BLASTX hits and 41.35% were annotated in the Blast2GO pipeline. Several positively selected genes related to salt tolerance and vivipary development were detected. These genes may be important for its adaptation. We also detected two WGD events that are shared among four palms, and the divergence of these palms occurred shortly after the recent WGD event. Because N. fruticans is the basal species in the family Arecaceae (exclude subfamily Calamoideae), these two WGD events likely occurred in the common ancestor of all palms, which provided a large amount of raw material for the following more than 2,000 speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.

Ethics statement
All of the necessary permits for field studies were obtained. The Hainan Dongzhaigang National Nature Reserve Authority provided permission to collect the samples for our research. The distribution of synonymous substitution rate (Ks) of the paralogs is displayed in red and blue for N. fruticans and P. dactylifera, respectively. Peaks I and II indicated two whole-genome duplication events in both species. The yellow bars showed the distribution of Ks of the orthologs between the two species. Peak III indicated that the divergence time of the two palms was close to the recent duplication events (peak II). (B) The distribution of Ks of paralogs of E. guineensis and C. nucifera. The same two whole-genome duplication events can be found for the two species. Plant materials and RNA sequencing N. fruticans used in this study was collected from Dongzhai Harbor, Haikou, Hainan, China (19°57'12" N; 110°33'59" E). Since the main object of the study is not to compare the expression level of genes but to generate a large number of nucleotide sequences of N. fruticans to do the following phylogenetic and other analysis, only one transcriptome was sequenced. Total RNA was isolated from the fresh young leaves of one individual using the modified CTAB method and precipitated with 5 M LiCl at -20°C overnight. The resulting RNA pellets were suspended in 70 μL DEPC-treated water. The quantified total RNA (concentration > 500 ng/μL; rRNA ratio = 1.2) was delivered to the Beijing Genome Institute (Shenzhen, China) for further treatment. The mRNAs were extracted from the total RNAs using Oligotex TM -dT30 (TaKaRa, Dalian, China) and then fragmented ultrasonically and converted to double-stranded cDNAs using random primers. A nucleotide "A" was added at the 3'-end of cDNAs, and then adapters were ligated to both ends. The purifying cDNAs were sequenced via HiSeq 2000 (Illumina Inc., San Diego, CA, USA) with a read length of 90 bp and insertion size of 200 bp.

De novo assembly and annotation
The raw reads were filtered using SolexaQA [32] with the following steps: the raw reads were trimmed via DynamicTrim (quality threshold = 20) and then filtered using LengthSort (length cutoff = 50 bp). The clean reads were de novo assembled into contigs using Trinity [15] with the default parameters except for "min_kmer_cov = 2". We removed the redundancy of the assembled contigs using TGICL [33] and CD-HIT [34]. The clean reads were then mapped to the retained contigs using bowtie2 [35], and their mean coverage was calculated. Contigs with a mean coverage of less than two were filtered out. The remaining contigs were treated as unigenes.
To obtain the functional annotation of N. fruticans, the unigenes were firstly searched against the NCBI non-redundancy (NR) protein database using BLASTX with an e-value threshold of 10 −6 ; the results were then evaluated using Blast2GO (v.3.0.0 PRO) [36] for the downstream analyses. The distribution of level-2 Gene Ontology (GO) terms of the annotated unigenes was plotted using WEGO for biological processes, molecular functions and cellular components categories [37].

SSRs identification and validation
We used MISA (MIcroSAtellite identification tool, http://pgrc.ipk-gatersleben.de/misa/) to identify the simple sequence repeat (SSR). The SSRs were considered to contain motifs with two to six nucleotides and a minimum of five contiguous repeat units.
To validate the predicted SSRs, we used Primer3 software [38] to design primers around the SSRs. The key parameters set for primer design were as follows: primer length 18-24 bp, PCR product size 100-280 bp, optimum annealing temperature 55-60°C and GC content 40%-65% with 50% as the optimum. We random picked 36 pairs of them to do the PCR experiment. PCR were set up in a 30μl reaction mixture in PCR tubes. Each reaction mixture contained 0.4 μM each forward and reverse primer, 0.4 mM each dNTP, 1.5 U LA Taq DNA polymerase and 3μl 10x LA Taq buffer (TaKaRa, Japan) and 50-100 ng of genomic DNA as a template. PCR amplification was carried out using a Dyad Disciple thermal Cycler (Bio-Rad, Beijing, USA) with the following cycling conditions: pre-denaturation at 94°C for 5 min followed by 33 cycles of 94°C for 30 sec, 57°C for 45 sec and 72°C for 1 min 30 sec, and finally, 10 min at 72°C for extension. Twenty-five of them successfully got the PCR products and were sequenced by Sanger sequencing.

Identification of candidate positively selected genes
We combined the genome sequences of P. dactylifera, E. guineensis and O. sativa to cluster gene families using the OrthoMCL software [29] and obtained 2,136 single-copy genes. We then detected positively selected genes using CODEML in the PAML 4.8 package [39]. Setting N. fruticans as the foreground branch, we used the improved branch-site model to calculate the likelihood of the null model (no sites in the foreground were positively selected) and alternative model (existing sites in the foreground were positively selected) to obtain the likelihood ratio. To remove false discoveries, the Benjamini-Hochberg correction for multiple testing was used (FDR < 0.05). Positively selected genes were also annotated based on the annotation of the orthologs of O. sativa (MSU Rice Genome Annotation Project). The orthologs of A. thaliana were adopted from the orthologous groups of O. sativa in MSU Rice Genome Annotation Project (http://rice.plantbiology.msu.edu) and The Arabidopsis Information Resource (https:// www.arabidopsis.org). For the genes with no record in the databases, we used Blastx with an evalue of 10 −6 to find the orthologs.

Identification of ortholog groups and divergence time estimation
The OrthoMCL software was used to cluster the ortholog groups of five monocots, including four palms (N. fruticans, P. dactylifera, E. guineensis, C. nucifera) and the out-group species Oryza sativa [40]. We firstly extracted the coding region sequences (CDS) of these unigenes based on the BLASTX results and then translated CDS to proteins using a Bioperl script. Allversus-all BLASTP was conducted for all protein sequences with a cutoff of 10 −10 , and hits with identities of less than 40% were removed. The results were fed to OrthoMCL software to obtain ortholog groups with default settings. Protein sequences of each single-copy ortholog from the OrthoMCL results were retrieved and aligned using MUSCLE [41]. The protein alignments were then converted to nucleotide alignments using pal2nal [42]. Any alignment shorter than 150 bp was removed from the analysis.
To reconstruct the phylogenetic tree, we first performed a model test using jmodeltest2 [43]. The best-fit model was GTR+G. The phylogenetic tree was then reconstructed using PhyML3.0 [44] with the best-fit model and 1,000 bootstraps. We estimated the divergence time of each node using MCMCTREE of the PAML 4.8 package [39]. The time constraints were set as follows: 1) the divergence time between N. fruticans and O. sativa was set to 117-128 Mya because of the stem node age of Araceae and Poales [45]; 2) the divergence time of P. dactylifera and E. guineensis was set to 60-70 Mya [9].

Detection of whole-genome duplication events
The protein sequences of N. fruticans and C. nucifera were used to conduct BLASTP against itself. Then, the OrthoMCL software was used to obtain the paralogs of N. fruticans. For the published genomes of P. dactylifera and E. guineensis, we used the synteny genes as paralogs. The methods of sequence alignment and filtering criteria are the same as those in the previous section. The synonymous substitution rate of each paralog for each species were calculated using the KaKs-Calculator [46] with the NG (Nei-Gojobori) model.