De Novo Transcriptome Assembly (NGS) of Curcuma longa L. Rhizome Reveals Novel Transcripts Related to Anticancer and Antimalarial Terpenoids

Herbal remedies are increasingly being recognised in recent years as alternative medicine for a number of diseases including cancer. Curcuma longa L., commonly known as turmeric is used as a culinary spice in India and in many Asian countries has been attributed to lower incidences of gastrointestinal cancers. Curcumin, a secondary metabolite isolated from the rhizomes of this plant has been shown to have significant anticancer properties, in addition to antimalarial and antioxidant effects. We sequenced the transcriptome of the rhizome of the 3 varieties of Curcuma longa L. using Illumina reversible dye terminator sequencing followed by de novo transcriptome assembly. Multiple databases were used to obtain a comprehensive annotation and the transcripts were functionally classified using GO, KOG and PlantCyc. Special emphasis was given for annotating the secondary metabolite pathways and terpenoid biosynthesis pathways. We report for the first time, the presence of transcripts related to biosynthetic pathways of several anti-cancer compounds like taxol, curcumin, and vinblastine in addition to anti-malarial compounds like artemisinin and acridone alkaloids, emphasizing turmeric's importance as a highly potent phytochemical. Our data not only provides molecular signatures for several terpenoids but also a comprehensive molecular resource for facilitating deeper insights into the transcriptome of C. longa.


Introduction
Curcuma longa, commonly known as turmeric, is a rhizomatous small perennial plant from the ginger family (Zingiberaceae). Turmeric has been used as a colouring and flavouring additive in day to day cooking in India and many east asian countries for centuries and also used as a household remedy for many ailments. Turmeric at alkaline PH turns bright red and it is widely used as Vermilion, also known as kumkum, an important part of Hindu religious ceremonies. Over the last few decades, turmeric has gained global recognition for its medicinal importance after many studies that were conducted to understand its medicinal properties, yielded exciting results. The primary active constituent of turmeric is an important secondary metabolite namely, curcumin. It's role as an antimalarial [1], anti-inflammatory [2,3] and antitumor [3] compound has been well appreciated worldwide and it has also been known to modulate lipid metabolism, which has been implicated in obesity [4]. In addition, curcumin has also been used in clinical trials to treat Azheimer's [5].
Turmeric oil/oleoresin, extracts which contain curcuminoids and essential oils are used for flavouring and colouring. It has been shown that turmeric oleoresin has hypoglycemic [6], antiamyloidogenic [7], paracitidal [8], antimicrobial [9] and larvicidal [10] effects. Apart from curcumin there are many compounds in turmeric oil that contribute to its above mentioned properties. Nonetheless, global efforts in turmeric have been concentrated on studying and modifying curcuminoid biosythetic pathway and a thorough study of transcriptome is so far not attempted. Such studies might shed light into the functional genes and aid to understand the diverse pathways involved in phytomedical attributes of turmeric.
In recent years a number of non-model plants have been sequenced using various Next Generation Sequencing (NGS) platforms [11][12][13][14][15][16][17][18][19][20][21][22]. Despite turmeric's growing medicinal and economic importance, a comprehensive transcriptome level investigation is lacking. In this study an attempt has been made to analyse and annotate C. longa transcriptome from the rhizomes of three popularly cultivated cultivars in south India by assembling short paired-end Illumina reads. Cultivar Nattu (traditional) yeilds small rhizomes, cultivar Erode is widely grown commercial variety with larger rhizomes and cultivars Mysore requires higher irrigation with lower maturation time. Expression studies were conducted to observe differences across the three cultivars. The transcriptome will serve as an invaluable genomic reference to further our knowledge about turmeric at a molecular level.

Sequence Quality Control
A total of 20,519,88062 (72 base), 30,342,59862 (73 base), 37,193,40362 (100 base) raw reads were generated from Illumina GAIIx sequencer, accounting for approximately 2.9 Gb, 4.4 Gb and 7.4 Gb of sequence data, for cultivars Nattu, Erode and Mysore respectively. The raw paired-end sequence data in FASTQ format was deposited in the National Centre for Biotechnology Information's (NCBI) Short Read Archive (SRA) database under the accession number SRA052613. Raw reads were subjected to quality control (SeqQC). High quality (.Q20) bases were more than 90% in both the forward and the reverse (paired-end) reads (Table1). After removing the adapter and low quality sequences from the raw data, 34,924,986, 48,755,296 and 63,574,950 high quality reads were retained for cultivars A, B and C respectively. These high quality, processed paired-end reads were used for further analysis.

Transcriptome Assembly and Clustering
Filtered reads were assembled into contigs using Velvet at a hash length of 45, which generated 137,148, 91,995 and 203,400 contigs for cultivars Nattu, Erode and Mysore respectively. Further transcriptome assembly using Oases resulted in 56,770, 65,924 and 91,958 transcripts. Figure 1A shows the transcript length distribution ranging from 200 bases to more than 3000 bases. We pooled and further assembled the individual assemblies of the three cultivars to create a reference sequence for comparative analysis. Representative transcripts (RTs) obtained after clustering using CD-HIT contained 9,568, 13,679 and 38,300 transcripts from cultivars Nattu, Erode and Mysore respectively. Clustering resulted in 61,538 RTs.
The percentage of Ns in the assembly were found to be minimal: approximately 0.001% for cultivars Nattu and Erode 0.004% for cultivar Mysore and 0.002% for RTs. Total length of RTs was found to be approximately 56Mb and the mean transcript length was 910 bases (Table 2). RTs were observed to be marginally AT rich, with 57.37% AT content ( Figure 1B). The RTs can be accessed at TSA within the accession number range JW751789-JW813326.

BLAST Against C. longa Nucleotide Sequences and ESTs from ArREST
Sequence similarity search between RTs and GenBank's C. longa ESTs showed that 9,307 (15.1%) RTs were similar to 11,139 (86.8%) C. longa ESTs at an E-value cut-off of e-5 (,0.00001). Of these, 11,115 sequences matched with a sequence identity greater than 80% while the remaining sequences matched with an identity above 70%. A vast majority of the ESTs (5,372) were observed to align with more than 90% coverage ( Figure 2). This search also revealed the presence of curcumin synthase in the transcriptome (Additional file S1).
Sequence similarity search between RTs and ArREST ESTs revealed that 68,983 (87%) of the ArREST ESTs were identified in RTs. N50 value of RTs (1,515 bases) was significantly higher than that of ArREST ESTs (467 bases). We obtained a maximum contig length of 15,293 bases as opposed to 6,639 bases of ArREST ESTs. ArREST contained only 22,560 (28.7%) transcripts above 200 bases, which explains the lower mean transcript length of ArREST ESTs (273.76 bases) when compared to that of RTs (910.14 bases). Overall our dataset (RTs) represents transcripts with much longer sequences and better transcriptome coverage (Additional file S2).

Functional Annotation
We utilized six different databases in this study to annotate C. longa, as there were very limited reference information available. In toto, 33,614 (54.6%) RTs were annotated against all six databases. Around 33% (20,436) of RTs received Swiss-Prot annotation (Additional file S3). Out of these, 679 (3.3%) were annotated as Putative proteins while 2,285 (11.2%) were annotated as proteins with probable functions. This indicates that a large proportion of proteins received a definitive annotation from Swiss-Prot. GO terms were retrieved from Swiss-Prot annotation. About 33,172 (53.9%) of RTs were annotated with TrEMBL, an automatically annotated relatively larger database when compared to Swiss-Prot but poor in information, increases the chances of annotation with a previously known protein. Annotation against PlantCyc database resulted in the annotation of 8,789 (14.3%) RTs with enzymes from 255 pathways (Additional file S4). KOG and Genbank annotation resulted in the annotation of 19,383 (31.5%) and 3,311 (5.4%) RTs respectively. We also observed that around 39% (23,992) RTs received Pfam annotations. GO annotation showed that the annotated RTs represent genes with diverse functionalities and are involved in various metabolic pathways. We observed 26,638, 37,689, and 31,759 GO terms representing Cellular component, Molecular function and Biolog-ical process categories. In the cellular component category, the terms integral to membrane and nucleus were observed to occur most frequently, constituting 16.3% (4,353) and 14.8% (3,932) of total cellular component entries respectively. ATP binding and DNA  binding were found be the most frequently occurring under molecular function category, constituting 10.5% (3,951) and 4.9% (1,854). In the biological process category transcription,DNAdependent and regulation of transcription,DNA-dependent were observed most frequently constituting 6.9% (2,181) and 4.6% (1,476). Of transcripts with an assigned biological process term, response to stress, defense response and response to salt stress were also observed to occur more frequently, together constituting 3.9% (1,241) ( Figure 3). Since, the rhizome is buried in soil it is more prone to pathogen attacks and salt stress, hence it is expected to find defense and stress related terms in high numbers. Such higher occurrences of stress related categories, also indicates the possible presence of a large number of secondary metabolites. Each KOG cluster contains a protein or a group of proteins from at least 3 different eukaryotic lineages further clustered based on their function. All 25 KOG functional categories were represented in the annotation (Additional file S5). Annotation with KOG also revealed that 469 RTs, constituting 2.4% of total RTs annotated with KOG, fell in secondary metabolites biosynthesis, transport and catabolism category reflecting the vast repertoire of secondary metabolites present in the plant ( Figure 4).
For distantly related homologs, sequence similarity search may not yield significant information, hence transcripts were searched for the presence of conserved protein domains (Additional file S6). Protein kinases (PF00069 and PF07714) were found to be the most abundant domains and given their involvement in a variety of cellular processes, including metabolism, transcription, cell movement, differentiation and apoptosis, their abundance is expected. Annotation revealed the presence of Myb_DNA-binding domain,  Annotations from individual databases were used in interpreting findings and a final annotation table was obtained in order to arrive at a single best annotation for each transcript. The final annotation  (Table 3). TrEMBL initially had the highest share of annotations. However, in the final annotation table, a major share of the results was distributed among the well annotated databases (Swiss-Prot and KOG).

Mapping, Calling Variations and Quantifying Transcripts
The reads from cultivars Nattu, Erode and Mysore were aligned back to RTs. Alignments showed that 90.40% reads from cultivar Nattu, 91.24% reads from cultivar Erode and 93.07% reads from cultivar Mysore were aligned to the RTs (Table 4). Alignment file was processed by SAMtools for variant calling. In cultivar Nattu, 1,00,920 variations were classified as homozygous SNPs while  In this study, cultivar Nattu was used as control and the differential expression of transcripts in Erode and Mysore cultivars were determined. We observed 1,774 and 1,356 differentially expressed transcripts in Erode and Mysore cultivars respectively ( Figure 6). Of the 1,774 differentially expressed transcripts in cultivar Erode, we observed 629 upregulated transcripts and 1,145 downregulated transcripts. Similarly, in cultivar Mysore, we found 793 upregulated transcripts and 563 downregulated transcripts. We were not able to find significant expression level differences among the three cultivars at the transcriptome level. However, one significant observation in the expression analysis is that hypericin, a secondary metabolite with antitumor [23,24] and antibiotic properties [25], was found to be upregulated in both Erode and Mysore cultivars.

Discussion
Next Generation Sequencing technologies have revolutionized sequencing and sequencing is no longer laborious and costintensive. Transcriptome sequencing of non-model plants are gaining importance in recent years as it allows sequencing only the transcribed regions at low cost.
We obtained more than 90% HQ bases (Table 1) for all the cultivars which reflects a high quality sequencing run. Low quality bases as well as the presence of adapters in reads could interfere with the assembly process resulting in misassemblies or truncated contigs. Hence, low quality bases and adapter sequences were trimmed before proceeding with further analysis. Such trimmed HQ reads were used to arrive at a high quality assembly. N50 statistic is widely used to assess the quality of the assembly. Higher the N50 value better the assembly is. The observed N50 value (1,515 bases) was higher than those obtained in other plant transcriptome sequencing projects (Table 6) suggesting a better assembly. Nonetheless, a better assembly does not guarantee an accurate assembly because assembling plant sequences pose different challenges, as plants can have higher rates of heterozygosity and repeats [26]. However, with improvements in assembly algorithms [27], accurate assemblies will be made possible in future. Clustering of the three assemblies is expected to reduce the sequence redundancy arising either out of merging three assemblies or multiple isoforms inherent to an individual assembly or both. Clustering also enriches the information contained in the cluster by complementing from all three assemblies.
COG and GO annotations indicate the presence of transcripts involved in stress resistance and defence mechanisms. In this study we have included a pathway annotation even if only a single enzyme from a pathway is observed. For example, out of twenty different enzymes involved in perillyl alcohol biosynthesis we observed only (2)-limonene-3-hydroxylase enzyme once, yet there is phytochemical evidence for its presence in turmeric [28]. This justifies our decision to include all pathway annotations without setting any threshold. Sequence similarity search indicated the presence of curcumin synthase, an enzyme involved in the synthesis of curcumin, well known as a potent anti-cancer compound. It is known to act at various chronological stages, right from initial insults that cause DNA damage to metastasis by modulating a multitude of pathways. Effects of curcumin on cell cycle regulation, apoptosis, NF-kB and AP-1 transcription factors, autophagy, angiogenesis and metastasis have been well elucidated [3,[29][30][31][32][33][34][35][36][37].
It has been suggested that plants have developed terpene based host defence, which also represents a repertoire of therapeutic compounds [38]. Hence in this study, we have focused our analysis towards terpenoids. Major share of transcripts related to terpenoid pathways were found to be from menthol biosynthesis (25%). Recent evidences indicate that menthol has a potent anticancer property, effecting cell death through TRPM8 receptor [39]. Here in this study for the first time, we report the present of genes related to taxol biosynthetic pathway in turmeric. Taxol biosynthetic pathway was the third most represented (8.11%) terpenoid pathway in the transcriptome (Figure 7). Further studies into this pathway in turmeric could be benificial as the combination of taxol with curcumin has produced promising results in treating cancer [40].
We also observed transcripts related to pathways involved in synthesis of several other anti-cancer compounds including vincristine, vinblastine, matairesinol, hypericin, xanthohumol, simplecoumarins, geraniol and coumestrol [23,[41][42][43][44][45][46][47]. Epidemiological studies suggest that lower incidences of colon cancer in  India is due to consumption of diets rich in curcumin [48], but the presence of transcripts related to biosynthetic pathways of other anti-cancer molecules suggest that they along with curcumin might have elicited an effect in preventing cancer. However, further studies in this direction will help in validating this assertion. Pathway annotation of transcripts showed the presence of transcripts related to artemisinin and acridone alkaloids biosynthetic pathways. Both artemisinin and acridone alkaloids are proven to be effective in anti-malarial treatments [49][50]. Moreover curcumin per se has been reported to have anti-malarial activity as it is known to exhibit prooxidant properties in P. falciparum, at concentrations which are non-toxic to mammalian cells, inducing oxidative damage resulting in the death of parasite [51].
Our analysis revealed a number of transcripts of such sesquiterpenes as capsidiol, gossypol, phaseic acid, bergamotene, germacrene and farnesene when compared to monoterpenes linalool, geraniol, menthol and perillyl alcohol. It has been demonstrated that essential oils from leaves are usually dominated by monoterpenes while the oils from rhizomes mainly contains sesquiterpenes, which are synthesized in response to a pathogenic attack [28]. Our findings are consistent with this statement.
SNPs and Polymorphic SSR markers play an important role in genetic diversity analysis. The influence of SSRs on gene regulation, transcription and protein function typically depends on the number of the repeating units [52]. In the present study, trinucleotide type SSR motifs occurred more frequently which is consistent with findings from other studies involving monocots, like rice, barley and wheat [53].
The de novo transcriptome of this very important phytochemical herb brings out for the first time novel transcripts related to anticancer, antimalarial and anti-oxidant properties. Proper validation of the results at biochemical, cellular and animal model studies will certainly highlight more useful properties of turmeric in traditional and alternative medicine. The data may also aid plant breeders to engineer cultivars with enhanced terpenoid profiles.

Sample Collection and Preparation
Rhizome samples of the following three widely grown cultivars in South India were chosen for our study, cultivar Nattu, cultivar Erode and cultivar Mysore. RNA was extracted from the rhizome

Sequencing and Quality Control
Illumina GAIIx was used to generate 72 (Cultivar Nattu), 73 (Cultivar Erode) and 100 (Cultivar Mysore) base paired-end short reads using Sequencing By Synthesis (SBS). Standard Illumina pipeline (RTA-CASAVA-OLB) was used to generate short reads in FASTQ format. Accuracy of base calling is reflected in the quality scores and low quality scores usually denote high error probabilities. Low quality bases, if due to errors, will interfere in the assembly process either resulting in misassemblies by collapsing repeat regions or truncated contigs by obscuring true overlaps [26]. Hence, quality filtering is very essential in order to arrive at a high quality assembly. Hence additional quality control was performed using in-house program (SeqQC V2.1 -http:// genotypic.co.in/SeqQC.html) to generate high quality reads for use in assembly. The reads were filtered or trimmed for adapters, B trimming (CASAVA1.7 User Guide) and other low quality bases using in-house Perl scripts. These high quality, filtered reads were used for further analysis.

Transcriptome Assembly and Clustering
Contig assembly for all the three cultivars was carried out using a de Bruijn graph based de novo genome assembler Velvet-1.1.07 (http://www.ebi.ac.uk/˜zerbino/velvet/) [54]. Velvet takes in short reads and assembles them into contigs using paired-end information. A draft assembly was built with following parameters: hash length = 45, expected coverage = auto and coverage cutoff = auto. This draft assembly was used by observed-insert-length.pl and estimate-exp_cov.pl (from Velvet package) to estimate insert length and expected coverage parameters, which were then used to generate a final assembly. The values of the estimated insert length, insert length standard deviation and expected coverage for the three draft assemblies are as follows, Cultivar Nattu: 153, 53.17, 5; Cultivar Erode: 151, 49.67, 5 and Cultivar Mysore: 150, 44.55, 3. The resulting contigs were assembled into transcripts by Oases-0.2.01 (http://www.ebi.ac.uk/˜zerbino/oases/) [55], which uses the assembly from Velvet and clusters them into small groups (loci). It then uses paired end information to construct transcript isoforms. Assembly statistics were calculated using in-house Perl scripts.
The transcripts from three individual assemblies were clustered (CD-HIT v4.5.4 http://www.bioinformatics.org/cd-hit/) [56] in order to generate a comprehensive reference. Sequence identity threshold and alignment coverage (for the shorter sequence) were both set as 80% to generate clusters. Such clustered transcripts were defined as reference transcripts in this work.

Functional Annotation
Database. No single database could be used to comprehensively annotate the transcripts. Hence using multiple databases for annotation could help in rich annotation of the transcripts and thereby providing insights into the function. In this study we have used six databases to derive annotation, which include Viridiplantae mRNA dataset from NCBI's GenBank (3,184 [57], PlantCyc Enzymes database (v2.0) [58], KOG proteins from Clusters of Orthologous Groups (COG) database (112,920 sequences) [59] and Pfam (v26.0) [60].
To make a final annotation table, a transcript's best annotation was chosen based on the BLAST scores [61]. Swiss-Prot, PlantCyc and KOG databases were given preferences and if a transcript does not have annotation in these databases then either GenBank Viridiplantae mRNA or TrEMBL annotation was chosen based on the blast scores and if a transcript does not have annotations in any of the above databases, Pfam annotation is assigned to the transcript.

BLAST against Curcuma Longa Nucleotide Sequences and ESTs from ArREST
As of 19 th April 2012, 12,833 Curcuma longa sequences (240 nucleotides and 12,593 ESTs) were available NCBI and were downloaded for BLAST search against representative transcripts at an E-value threshold of e-5 (,0.00001). Although, PlantCyc database (v2.0) did not contain curcumin biosynthetic pathway, all 3 isoforms of curcumin synthase mRNA (AB495007.1, AB506762.1 and AB506763.1) were available in NCBI. Hence, this BLAST search was also used to confirm the presence of curcumin synthase.
A total of 78,516 C. longa ESTs were downloaded from Aromatic Rhizome EST (ArREST) database http://www. plantrhizome.org/download/. All of these sequences were used for BLAST search against RT using an E-value threshold of e-5 (,0.00001).

Mapping, Calling Variations and Quantifying Transcripts
High quality filtered reads from each Cultivar were individually aligned (bowtie2 v2.0.0-beta5 http://bowtie-bio.sourceforge.net/ bowtie2/index.shtml) [63] to the RTs. End-to-end alignment allowing non-discordant read alignment was performed, with insert size of 900 bases, and only the best alignment were reported. The alignment was generated in Sequence Alignment/Map format. The alignments were processed for further analysis like variant calling using SAMtools v0.1.7a (http://samtools. sourceforge.net/) [64]. A combination of reads showing variation and read depth, along with mapping quality and SNP quality were considered for filtering the SNPs (Additional file S11).
Differential expression analysis was performed by employing a negative binomial distribution model (DESeq v1.8.1 package http://www-huber.embl.de/users/anders/DESeq/) [65]. Dispersion values were estimated with the following parameters: method = blind, sharingMode = fit-only and fittype = local. Cultivar Nattu was considered as a control to compare against other two cultivars (Erode and Mysore). P value threshold of 0.01 was used to filter statistically significant results.

Identification of SSRs
SSRs were detected using MIcroSAtellite Identification Tool (MISA v1.0). Minimum unit size cut-off of 6 was used to report a dinucleotide repeat, 4 for a trinucleotide repeat and 3 for SSRs of sizes 4, 5 and 6. A maximum distance of 100 nucleotides was allowed between two SSRs.

Supporting Information
File S1 BLAST against Curcuma longa nucleotide sequences. The table lists the best hit for  File S10 SSR information Cultivar C. This file provides SSR information for sample C (XLS) File S11 SNP filtering criteria. The file provides criteria used for filtering SNPs. (DOC)