Transcriptome Sequence Analysis of an Ornamental Plant, Ananas comosus var. bracteatus, Revealed the Potential Unigenes Involved in Terpenoid and Phenylpropanoid Biosynthesis

Background Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. Results The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. Conclusion The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

Introduction expression microarray results have produced much important information about how the transcriptome is deployed in different cell types [14] and tissues [15]. In 2012, first time microarray based gene expression study undertaken in pineapple [16]. This study identified a number of genes, processes and pathways with putative involvement in the pineapple fruit ripening process [16]. Following this sequencing, Ong et al. generated 4.7 million paired-end high quality reads of ripe yellow A. comosus var. comosus fruit using Illumina technology [17]. The assembly produced 28,728 unique transcripts with average length of approximately 200 bp. A total 16,932 unique transcripts were identified against non-redundant NCBI database. Of these 15,507 unique transcripts were assigned to gene ontology terms and 13,598 unique transcripts were mapped to 126 pathways in the genomes pathway database (Kyoto Encyclopedia of Genes and Genomes; http://www.genome.jp/kegg/). To date, however, the A. comosus var. bracteatus genome has not been fully sequenced to understand the underlying functional mechanisms and its encoded genes. As October 2014, only 110 nucleotide sequences, 0 ESTs, 3 protein and 0 genes from A. comosus var. bracteatus had been deposited in the NCBI's GenBank database.
In the present study, a de novo transcriptome sequencing for A. comosus var. bracteatus using the Illumina sequencing was initiated. Leaf, root and stem samples of A. comosus var. bracteatus were sequenced and a total of 23,584,613 (23.5 million) reads and 41,052 unigenes were identified. However, short read length limits de novo contig assembly efficiency. Discussions on sequencing bias of high-throughput technologies have taken place in several publications [18][19][20][21]. To our knowledge, this is the first transcriptome characterization of A. comosus var. bracteatus. In addition, we compared transcriptome sequence similarity with pineapple (Ananas sp.), maize (Zea mays), sorghum (Sorghum bicolor), rice (Oryza sativa), Arabidopsis (Arabidopsis thaliana). This information provides genes and gene network involved in the significant biological and economical characters of A. comosus var. bracteatus.

Results and Discussion
Sequencing and de novo transcriptome assembly The cDNA library for A. comosus var. bracteatus was prepared and sequenced using the Illumina Genome analyzer. In order to accurately analyze the data, we filter the raw data to ensure the quality of each reads the value of less than 20 nucleotides does not exceed 20%, N content of not more than 5% removal of ribosomal RNA. After data filtering, 23,584,613 (23.5 million) reads with 98.64% Q20 bases were selected as high quality reads for further analysis. After the removal of adaptor sequences and exclusion of contaminated or short reads, 23.5 million high quality read sequences were assembled into 1,134,553,18 contigs using SOAPdenovo [22]. Using the Trinity de novo assembly program, short-reads were assembled into 1,186,576,93 transcripts with a mean length of 1224.83 bp (Fig. 1A). The transcripts were subjected to cluster and assembly analyses. A total 41,052 unigenes were obtained, with an average length of 837.62 bp, among which 10,923 genes (26.61%) were greater than 1 kb (Fig. 1B). These results demonstrated the effectiveness of Illumina sequencing technology in rapidly capturing a large portion of the transcriptome.
The length distributions of contigs, transcripts and unigenes are shown in Table 1. The N50 values of transcripts and unigenes were 1869 and 1520 bp, respectively. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given unigene and the number of reads (Fig. 1C). To facilitate the access and use of A. comosus var. bracteatus, transcriptome sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) database with accession number SRX681749.

Sequence annotation
In order to make an functional annotation and classification for the putative identities of the assembly, all the unigenes were annotated by aligning with deposited sequences in diverse protein databases including the National center for Biotechnology (NCBI), nonredundant protein (Nr) database, NCBI non-redundant nucleotide sequence (Nt) database, UniProt/Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Cluster of Orthologous Groups of proteins (COG) and UniProt/TrEMBL. The best one was selected from the matches with an E-value of less than 10 -5 . Only 24,749 (60.2%) unigenes were annotated with a threshold of 10 -5 by performing a BLASTx using above diverse protein databases (Nr, Nt, Swiss-Port, COG, KEGG and TrEMBL). According to the BLASTx results, 23,275 (56.6%) unigenes have homologues proteins in the Nr protein database, and 19, 817 (48.2%) had significant matches in Nt database and while 17,579 (42.8%) unigenes had similarity to proteins in the Swiss-Port database are listed in S1 Table. The overall functional annotation is listed in Table 2. Out of 41,052 unigenes, only 24,749 (60.2%) unigenes matched to known genes while 16,303 (39.7%) unigenes were unmapped in those databases, which could be attributable to the short reads generated by the sequencing technology, most of which probably lack the conserved functional domains or insufficient sequences of pineapple in the public databases.
Based on the above similarity search, we conducted species distribution of pineapple unigenes based on the BLASTx analysis, it showed that the unigenes hit a range of plant species. Among the various plants that have protein sequences in GenBank, the A. comosus var. bracteatus unigenes had the highest number of homologous to Oryza sativa Japonica group (13%), Setaria italica (11%) and Vitis vinifera (11%). This was followed by Zea mays (6%), Brachypodium distachyon (6%), Sorghum bicolor (6%), Oryza sativa Indica group (5%), Theobroma cacao (4%). While less than 4% homologues to other plants and others (20%) were summarized in the Fig. 2. Interestingly, A. comosus var. comosus fruit transcriptome also had the highest similarities (26.2%) with O. sativa [17]. The high similarities of the A. comosus var. comosus [17] and A. comosus var. bracteatus (this study) unigenes to the O. sativa genes may provide for the possibility of using the rice EST's as a reference for future assembly using next generation sequencing.
In addition, functions of the assembled transcripts were classified using Gene Ontology (GO). A total of 17,746 unigenes were divided into three ontologies: cellular component, molecular function and biological processes. Of these, majority of the GO terms were assigned to biological processes (47%), followed by cellular component (27%) and molecular function (26%) (Fig. 3). The biological function category mainly comprised proteins involved in metabolic, cellular processes and response to stimuli, biological regulation and localization and biogenesis are most represented functions. Of these, genes involved in the metabolic and cellular processes were highly represented. In cellular components, the major classifications of these genes products were cell, cell part, organelle and membrane. For molecular functions, the genes involved in catalytic and binding activities were both highly represented. In comparison to pineapple fruit GO annotation [17], more number of the genes were expressed in all the three ontologies (cellular component, molecular function, biological function). GO annotation provided a general gene expression profile picture for A. comosus var. bracteatus, which showed that the sequenced genes were responsible for fundamental, biological regulation and metabolism. Based on the Nr annotation, all unigenes were subjected to a search against the COG database for functional prediction and classification. In total, 8,505 unigenes with hits in Nr database could be assigned to COG classification and divided in to 25 specific categories, listed in Fig. 4. The cluster for general function prediction (17.82%) represented the largest group, followed by translation, ribosomal structure and biogenesis (10.87%), replication, recombination and repair (8.54%) and transcription (8.33%), posttranslational modification, protein turnover, chaperones (7.05%) and signal transduction mechanisms (6.94%). Approximately 5% unigenes hit with energy production and conversion, carbohydrate transport and metabolism and amino acid transport and metabolism. However, unigenes involved in secondary metabolite biosynthesis were found in the categories of energy production and conversion and secondary metabolites biosynthesis, transport and catabolism respectively.
Comparative analysis with pineapple, rice, maize, sorghum and Arabidopsis In our transcriptome, we examined A. comosus var. bracteatus unigenes with five species namely pineapple, maize, sorghum, rice and Arabidopsis available genome data. To study the distribution of conversation between A. comosus var. bracteatus unigenes and pineapple EST sequences obtained from NCBI database (until October 2014) based on tBLASTx method. For this analysis, the E-values of tBLASTx similarity searches were used an estimate of sequence conservation. The detailed list of tBLASTx scores are listed in the S3 Table. Based on the above distribution of conservation search, we simultaneously examined more closely the putative functional role between A. comosus var. bracteatus unigenes and available pineapple EST sequences (Fig. 5).
To study the sequence conservation of A. comosus var. bracteatus unigenes in other plant species, a BLASTX search was performed against the four available proteomes of completely sequenced plant genomes. The largest number of A. comosus var. bracteatus unigenes exhibited 22,194 significant similarity hits to rice transcript, 21,886 unigenes had similarity hits to maize, 22,183 unigenes had similarity hits to sorghum and 21,537 unigenes had similarity hits to Arabidopsis (Fig. 6A). Based on the above similarity search, we then conducted GO analysis to compare the functional classification between A. comosus var. bracteatus and the above five species (Fig. 6B). The detailed results were listed in S4 Table. A total of 20,716 shared homologues among these four species. Of these, genes involved in metabolic and cellular processes are highly represented in biological processes. Under the molecular function ontology, proteins involved in binding and catalytic activity was dominant while in cellular component ontology, proteins for cell part, cell and organelle development were generally encoded by A. comosus var. bracteatus unigenes. Therefore, we conclude that the differences in the expression profile and functional allocation of the A. comosus var. bracteatus unigenes with sequence similarity hits to four species concurrently contribute to the divergence of A. comosus var. bracteatus from other crops.

Terpeniod backbone biosynthesis
Terpenoids, are the largest group of natural products, extensively used for their aromatic qualities and for the treatment of human diseases [23]. All terpenoids are synthesized through two independent pathways: the cytosolic mevalonic acid (MVA) pathway and the plastidialmethyerythriol phosphate (MEP) pathway (Fig. 7). In our transcriptome, we identified 43 unigenes, which encode all enzymes involved in the terpeniod backbone biosynthesis.
Next, conversion of isopentenyl diphosphate or dimethylallyl diphosphate to geranyl diphosphate (geranyl-PP) was carried out by dimethylallyltransferase (geranyl geranyl diphosphate synthase (GPS), EC 2.5.1.1, 3 unigenes) for the synthesis of monoterpens. The enzymes, farnesyldiphospahate synthase (FDPS, EC 2.5.1.10, 2 unigenes) and geranyl geranyl diphosphate synthase (GGPS, EC 2.5.1.29, 5 unigenes), are involved in the synthesis of farnesyl diphospahte (farnesyl-PP) and geranyl geranyl diphosphate (geranyl geranyl-PP) respectively. The enzymes which are responsible for synthesis of sesquiterpenes, caretenoid, steroid, porphyrin and chlorophyll metabolism, ubiquinone and other terpenoid-quinone from farnesyl-PP and geranyl geranyl-PP were identified in A. comosus var. bracteatus sequence assembly (Fig. 7). Transcriptome sequencing studies of Ong et al. [17] and our dataset results indicate the clear picture of terpenoid backbone synthesis in the genus Ananas. The EC number and number of unigenes encoding the enzymes involved in the terpenoid backbone biosynthesis are listed in the S5 Table. Phenylpropanoid biosynthesis The phenylpropanoid pathway serves as a rich source of metabolites in plants, being required for biosynthesis of lignin, flavonoids, coumarins and hydroxycinnamic acids [24,25]. In our annotated A. comosus var. bracteatus transcriptome datasets, we identified multiple unigenes encoding enzymes involved in known lignin and flavonoid pathways. Because of pharmacological effects of A. comosus var. bracteatus lignin and flavonoids, we focused on the related genes in this study.

Conclusions
In recent years transcriptome sequencing has become an effective tool to discover the unigenes at a given moment in time. The present work presents the first de novo transcriptome sequencing analysis using RNA-seq technology to profile the A. comosus var. bracteatus transcriptome on the Illumina platform. A total, 23,584,613 high-quality reads were obtained from the platform, and 41,052 unigenes were identified by de novo assembly. Among them, 24,749 (60.2%) unigenes were annotated indicating their relatively conserved functions. Functions of the unigenes were classified by COG, GO and found potentially involved in biological, cellular and molecular processes. In addition, A. comosus var. bracteatus genes related to terpenoid and phenylpropanoid pathways were characterized and their sequences compared to the sequence databases of pineapple, maize, sorghum, rice and Arabidopsis. To our knowledge, this is the first report on the characterization of the transcriptome of A. comosus var. bracteatus and assembly of the high quality reads was conducted without a reference genome. This work will provide a sequence source and facilitate the studies in molecular mechanism of terpenoid,  lignin and flavonoid biosynthesis in A. comosus var. bracteatus. Overall, our data will facilitate research on functional studies and serves as reference for other Ananas family members.

Plant materials and RNA extraction
A. comosus var. bracteatus was obtained from Sichuan Agricultural University garden, Sichuan Province. The plant is 2-year old which is generated from the sucker of the mother plant. The mother plants were obtained from the garden in Zhanjiang, Guangdong Province (coordinates 21°12 0 N 110°24 0 E), China [sampling place did not involve endangered or protected species and require specific permission for these locations]. Leaves with white margin and red color, roots, stem were dissected and total RNA extracted using the RNeasy plus Micro Kit (Qiagen, Hilden, Germany). The quality and quantity of total RNA was assessed at an absorbance ratio (OD 260/ 280 and OD 260/230 ) and 1% agarose gel electrophoresis. Equal amounts of RNA from each sample were mixed into a single large pool for cDNA synthesis.

cDNA library construction and sequencing
To obtain cDNA library, first stand synthesis cDNA was synthesized using random hexamer primers from purified Poly (A) mRNA. Second strand cDNA was synthesized using DNA polymerase I and RNase H. Short fragments were then separated by agarose gel electrophoresis and selected for PCR amplification as sequencing templates. Finally, cDNA library was sequenced on Illumina HiSeq 2000 sequencing platform, Beijing Biomarker Technologies, Beijing, China.

Sequence analysis and de novo assembly
The raw reads were cleaned by removing adapter sequences, low quality sequences (reads with ambiguous bases 'N'), and reads with >10% Q-value<20 bases. All sequences smaller than 60 bases were eliminated based on the assumption that small reads might represent sequencing artifacts [31]. The high quality reads were assembled into contigs using the Trinity method (http://trinityrnaseq.sourceforge.net/), which recovers more full-length transcripts across a broad range of expression levels and sequencing depths [32]. Subsequently, the contigs were linked into transcripts according to the paired-end information of the sequences. Based on nucleotide sequence identity, longest transcripts were considered as unigenes for functional annotation. Sequences with high nucleotide identity were screened and listed.

Functional annotation
The assembled sequences were compared against the NCBI Nr, Nt and Swiss-port databases with an E-value < 10 -5 [33]. The open reading frames (ORFs) were identified as the nucleotide sequence or as the protein translation using "GetORF" program of EMBOSS software package [34].We quantified transcript levels in reads per kilobase of exon model per million mapped reads (RPKM) [35]. The RPKM method is able to reflect the molar concentration of a transcript by normalizing for RNA length for the total read number in the measurement.
To annotate the assembled sequences with GO terms, the Swiss-Port BLAST results were imported into BLAST2GO, a software package that retrieves GO terms, allowing gene functions to be determined and compared [36]. The unigenes sequences were also aligned with COG database to predict and classify functions. Then KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways were assigned to assembled sequences using the online KEGG Automatic Annotation Server (KAAS), http://www.genome.jp/kegg/kaas/. All the mapped sequences were annotated to the KEGG database to obtain the enzyme commission (EC) number. The EC were then mapped to the KEGG pathway to obtain the KEGG Pathway-Maps.
Supporting Information S1