De Novo Transcriptome Sequencing Reveals Important Molecular Networks and Metabolic Pathways of the Plant, Chlorophytum borivilianum

Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum.


Introduction
Chlorophytum borivilianum is an important species of liliaceae family due to its exceptional medicinal properties. Overexploitation and extensive harvesting of the wild strands has threatened its status as 'endangered' species by International Union for Conservation of Nature and Natural Resources (IUCN) [1]. Due to its high medicinal properties, the species has been recognized as 26 th among the top priority medicinal plants to be protected and promoted by the Medicinal Plant Board, Government of India.
Chlorophytum borivilianum brag the exuberant references in many Ayurvedic classics like Charka Samhita (2 nd century B.C.), Sushrut Samhita (2 nd century A.D.), Raja Nighantu (17 th century A.D.) etc. (http://www.safedmusli.net). Its tubers are used for aphrodisiac, adaptogen, antiageing, health restorative and health promoting purposes. The amalgamation of C. borivilianum leaves with other herbs such as Withania sominifera, Emblica officinalis etc. makes the body resistant against sex related diseases and also delays menopause [2]. The above attributes have made C. borivilianum an essential ingredient in Ayurvedic, Unani and Allopathic formulations. Major phytochemical components reported from the roots of C. borivilianum include steroidal saponins, fructans and fructoligosaccharides (FOS), acetylated mannans, phenolic compounds and proteins [3,4]. Steroidal saponins, are considered to be the principal bioactive components responsible for the pharmacological properties [5] and borivilianosides, furostane type steroidal saponins, have been isolated and characterized from this plant [6,7].
Steroidal saponins are synthesized via the mevalonic acid (MVA) pathway, pervasively operating in cytoplasm [8], or through the newly discovered non-mevalonate pathway (MEP) located in plastids [9,10]. Cyclization of precursor compound, 2, 3-oxidosqualene, involving oxidosqualene cyclase (OSC) combined with modifications on steroid skeletons like hydroxylations and glycosylations lead to the formation of various saponins. Several OSC genes like cycloartenol synthase (CAS), lupeol synthase (LS), bamyrin synthase (b-AS) have been cloned from various plant systems [11,12]. According to the proposed pathway [13], some specific CYP450s and UDP-glycosyltransferases (UGTs) may catalyze the conversion of cycloartenol to various steroidal saponins. Little is known about the molecular mechanism of the biosynthetic pathway downstream of cyclization process involved in saponin biosynthesis. The functional genomics studies in C. borivilianum is still in its infancy and few expressed sequence tags (ESTs) have been generated by our group [13], with an aim to identify the differentially expressed genes in the leaf and root tissues of C. borivilianum. Also, efforts are being made to clone the genes of interest via the candidate gene approach [14] for this plant.
ESTs play significant role in accelerating gene discovery, especially in non model organisms where reference genome sequence is unavailable, as it is a rapid and relatively economical method for analyzing the transcribed region of the genome [15]. Besides this, ESTs are also helpful in large scale expression analysis, improving genome annotation, identifying splice variants, molecular markers identification and physical mapping [16]. For large scale transcriptome analysis, next generation sequencing has evolved to be a very useful technique for providing large expression data in much shorter time period, depth and coverage to expedite understanding of metabolic pathway as well as contribute to comparative transcriptomics, evolutionary genomics and gene discovery [17,18]. Illumina high throughput sequencing technology yields huge amount of parallel sequence short reads with larger coverage. Assembling these short read sequences is a challenging task in the absence of any reference sequence. However, many de novo assembly tools have been developed that can be used to analyze the short read sequences [19,20].
We have undertaken the first global analysis of C. borivilianum transcriptome. Strategy has been developed for de novo assembly of transcriptome using short-read sequence data generated by Illumina RNA-Seq method in lieu to identify candidate genes involved in saponin biosynthetic pathway. Furthermore, GC content analysis, identification of EST-SSRs and gene expression analysis has also been done. Transcriptome coverage, at 22.42 megabase pairs, was comprehensive enough to discover all known genes of several major metabolic pathways. The data has been submitted to the Sequence Read Archive (SRA) of NCBI database under the accession ID PRJNA 196968 and will serve as a public information platform for further studies in C. borivilianum.

Plant Material, growth conditions and saponins extraction
Plants of C. borivilianum were grown under controlled conditions in the growth chamber with a day/night temperature of 2761uC, for 16 h photoperiod (flux density of 200 mmol m 22 s 21 ) with 30% relative humidity at Panjab University, Chandigarh (India). Plants were regenerated from the previous year's roots in the month of April. Young leaves and roots of a two month old plant were harvested during the period of active growth (June, average outside temperature 38uC), snap frozen in liquid nitrogen and later transferred to a 280uC freezer until further processing.
Total saponins were isolated from leaf and root tissues by soxhelation method and analyzed by thin layer chromatography (TLC) [21]. Briefly, powdered tissues were defatted with petroleum ether (60-80uC). Ethanolic extracts were prepared by extracting defatted tissues with 95% ethanol in a soxhlet extractor. Ethanolic extract was suspended in water (100 ml) and then extracted with n-butanol (300 ml). The volume of n-butanol soluble portion was reduced to half under reduced pressure and finally saponins were precipitated by addition of diethyl ether. Chromatographic analysis of ethanolic extract was performed on silica gel-GF254 precoated plates using chloroform: glacial acetic acid: methanol: water (16:8:3:2, v/v) as mobile phase. Anisaldehyde (0.5 ml) mixed with glacial acetic acid (10 ml), methanol (85 ml) and concentrated sulphuric acid (5 ml) was used as spraying reagent. TLC plates, after spraying the reagent, were heated at 100uC and saponin spots were visualized.

RNA isolation, cDNA library construction and Illumina sequencing
Total RNA was extracted by combining the methods as described by Ghawana et al (2011), [22] followed by RNA purification and on column DNaseI digestion using miRNA Easy kit (Qiagen, Germany). The integrity of RNA samples was assessed by Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Sample with RNA integrity number (RIN) value more than 8.0 was selected for further use.
Library construction and sequencing was performed at Microarray core facility, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah, USA. cDNA library was generated using TruSeq TM RNA Sample preparation kit (Illumina, USA) according to the manufacturer's instructions. Briefly, oligo(dT) beads were used to purify poly(A) mRNA from total RNA. mRNA was fragmented using RNA fragmentation kit (Ambion, USA). First strand cDNA was synthesized from the fragmented mRNA using random hexamer primer and reverse transcriptase (Invitrogen, USA). Single-end cDNA library was prepared in accordance with Illumina's protocol with an insert size of 150 bp. 50-cycled single end library sequencing was performed using Illumina Hiseq2000 platform. Raw reads were filtered to obtain high-quality clean reads by removing adaptor sequences, duplication sequences using FastQC software and bases with Phred score ,20 were trimmed. Based on the quality check for each base pair in the reads, last two base pairs from each read were removed in order to minimize the sequencing error, which is usually higher in the 39 end of reads.

De novo assembly of the sequences and clustering
All the assemblies were performed on Red Hat based SGI workstation with 48 cores, 2.27 GHz Intel Xeon processor and 50 Gb random access memory. We used SOAPdenovo (version 1.04; http://soap.genomics.org.cn/soapdenovo.html) which applies de Bruijn graph algorithm for de novo assembly of high quality (HQ) sequence reads to generate a non-redundant set of transcripts. Clean reads were first split into different 'k-mers' for assembly, in order to produce contigs, using the de Bruijn graph. K-mer size of 23 achieved best balance between the number of contigs produced, coverage and average sequence length obtained.
In order to reduce sequence redundancy, clustering process was supplemented with TIGR gene indices clustering tools (TGICL), which is based on pairwise sequence similarity and then assembly by individual clusters in order to retrieve better results [23]. Clusters were then passed to contig assembly program (CAP3) assembler for multiple alignments and consensus building. TGICL and CAP3 assembly were run under default parameters. Resulting singletons and consensus contigs were merged for final assembled transcripts.

Functional annotation and Classification
All the assembled unigenes (consensus and singletons) longer than 100 bp were annotated by assigning putative gene descriptions and Gene Ontology (GO) terms on the basis of sequence similarity with previously identified genes annotated with similar details. Unigenes were subjected to BLASTX search against nonredundant protein database and significant hits with e-value #1.0e 205 were extracted. Transcripts that did not show any significant hit were searched using BLASTN tool against the nonredundant database. Functional categorization by GO terms (http://www.geneontology.org) [24] and Enzyme Commission (EC) database was carried out based on Anno8r tool [http://www. nematodes.org/bioinformatics/annot8r/] with E-value threshold of 1.0e 205 . GOSlim terms for molecular function, biological process, and cellular component categories associated with the best BLASTX hit with non-redundant database were assigned to the corresponding C. borivilianum transcripts. KOBAS 2.0 (http:// kobas.cbi.pku.edu.cn/home.do) software was used for annotation of transcripts with KEGG pathways. For this, we first used KEGG orthology data provided at website and locally BLAST all transcripts with it. The transcripts showing similarity with KO database at e-value cut-off of 1e 205 were selected. The resultant data was used as an input for annotations wherein, we mapped transcript ID's to KO terms which finally annotate the transcripts with KEGG pathways. The 'Identify' program in KOBAS 2.0 uses annotation to identify enriched pathways. Oryza sativa, being a monocot, was used as background species. Arabidopsis thaliana, being a model organism was also used as background species. Hyper geometric distribution was used for p-value calculation. The false discovery rate (FDR) was calculated by using q-value and other parameters were used as default.

Read mapping onto C. borivilianum transcripts
The expression level of each assembled transcript was measured through reads per kilobase per million mapped reads (RPKM) values. All reads were mapped onto the non-redundant set of transcripts to quantify the abundance of assembled transcripts. SeqMap [25] was used for read mapping and rSeq [26] was then applied for RPKM based expression measurement. Assembled sequences were used as reference sequence to map back short reads and to measure RPKM for all assembled transcripts as suggested by Mortazavi et al. [26] and Jiang and Wong [25]. For RPKM measurement, filtered reads were first mapped back to various assembled transcripts and total mapped reads were estimated. Unique mapped reads were assigned to each assembled transcript allowing maximum two mismatches. For each such cluster having similar sequences in database, longest sequence was considered as the representative sequence for the unique gene it represented. The associated GO terms and Id's were parsed for each of such sequence and their corresponding RPKM values for different assembled genes were calculated.

GC content analysis and Simple Sequence Repeats (SSRs) identification
GC content analysis was done using in-house developed R script. A Perl script known as MIcroSAtellite (MISA, http://pgrc. ipk-gatersleben.de/misa/) was used to identify SSRs in the unigenes. Repeats of mono-nucleotide more than 10 times, dinucleotides repeats more than 6 times, tri-, tetra-, penta-and hexanucleotide repeats more than 5 times were considered as search criteria in MISA script.

Identification of Transcription Factor families
For identification of transcription factor families represented in C. borivilianum transcriptome, transcripts were probed against all the transcription factor protein sequences at Plant transcription factor database (PlnTFDB; http://plntfdb.bio.uni-potsdam.de/v3. 0/downloads.php) using BLASTX with an e-value cut-off of 1.0e 205 .

Transcriptome sequencing and de novo assembly
Illumina Hiseq 2000 (Illumina, USA) run representing the cDNA library from leaf tissues produced 22,595,634 single end (SE) reads. Each read was 50 bp in length encompassing nearly 4.0 Gb of sequencing data in fastq format. Since 39 ends of reads are more prone to sequencing error, so after quality check for every 50 bp read, only 48 bases (excluding 2 bases at 39 end) were considered for further analysis. Sequence data was filtered for lowquality reads and reads containing primer/adaptor sequences. Reads having the Phred quality score $20 (an error probability of 0.01) were selected for further use. After quality filtering, a total of 22,161,444 high-quality SE reads (3.8 Gb, 95% of the raw data) remained. Quality of the clean reads data was assessed using FASTX Toolkit (www.hannonlab.cshl.edu/fastxtoolkit).
De novo assembly of C. borivilianum transcriptome was optimized after assessing effect of various k-mer lengths. The high quality trimmed reads (48 bp) were assembled using SOAPdenovo program at k-mer length of 17, 21, 23, 25, 27, 31, and 35 (Table 1). Total number of transcripts decreased linearly with the increment of k-mer size suggesting over-representation at lower kmer and under-representation at higher k-mers. It was observed that sequences assembled at higher k-mer were enriched for transcripts with higher coverage/higher expression. For assembly process, only those reads were considered that produced high frequency k-mer. Several output parameters were analyzed that included total number of contigs, contigs with length 100 bp and above, N50 length, longest contig length, and average contig length as a function of k-mer. For the above mentioned data set, kmer size of 23 emerged as the best for assembly with N50 length of 245 bp, largest contig length 3,168 bp and average contig length of 221 bp (Table 1). A total of 101,589 contigs having length of at least 100 bp were generated. These contigs made the final representatives for assembled sequences for this study. Total bases covered by contigs with length greater than 100 bp and above came out to be 22.47 Mb (Table 2). Although the majority of the contigs lie between 1 bp to 100 bp, a total of 6568 contigs with a size of 500 bp and above were are generated (Fig 1a).

Sequence clustering and similarity search
To reduce any redundancy, assembled sequences from the above analysis were clustered by hierarchical clustering with TGICL [27], which generated 1273 clusters. These clusters were further assembled using CAP3 program [28] and using 896 sequences, 448 consensus sequences were generated. This resulted in reduction of uniquely assembled contigs from 101,589 to 101,141. Contigs shorter than 100 bp were removed and total coverage with contig length $100 bp came out to be 22.42 Mb. Although majority of the contigs lie between 150 bp to 500 bp, we obtained a total of 6,290 contigs with a size of .500 bp (Fig 1b).
For contig annotation, similarity search was performed using BLASTX against the non-redundant (NR) protein database with an E-value cut off 1.0e 205 . For read assembled transcript sequences, significant BLASTX hits were found for a total of 50,826 sequences (50.4% of the total unigenes obtained; Table  S1). The E-value distribution of the top hits in the NR database showed that 15.9% of the mapped sequences have high similarity (,1.0e 250 ), while the other 84.1% of sequences, similarity ranged between 1.0e 205 to 1.0e 250 (Fig 2a). The similarity distribution revealed 67.5% of the query sequences have a similarity higher than 80%, while 32.4% of the hits have a similarity ranging from 20% to 80% (Fig 2b). Homologous genes come from several species, with 23.5% of the unigenes having the highest homology to genes from Vitis vinifera followed by Oryza sativa (16.3%), Populus trichocarpa (10.8%), Ricinus communis (9.0%), Glycine max (8.1%) and Sorghum bicolor (7.2%) (Fig 2c).
Due to lack of C. borivilianum genome information, nearly half of the unigenes did not show a match in the plant protein dataset of NR database. Moreover, 63.4% of the above mentioned 50,826 annotation descriptions were uninformative (e.g., 'unknown', 'unnamed', 'putative', or 'hypothetical' protein).

Validation of assembled sequences against the available ESTs of C. borivilianum
The assembled sequences were validated by sequence alignments against ESTs of C. borivilianum submitted at NCBI dbEST by our group [13]. BLASTN analysis of assembled transcripts was performed with 459 ESTs with an E-value threshold of 1.0e 205 (Table S2). Alignment parameters were manually checked and also analyzed for mis-assemblies. Significant hits were observed for 394 sequences (85.8%), while no hit could be obtained for 65 ESTs from the assembled transcript set. It was observed that most of the assembled transcript sequences aligned correctly and in continuous form, with average identity of 95.67%, showing good assembly quality. The possible reason for unmapped unigenes of C. borivilianum could be due the presence of relatively short and low quality singletons, UTR sequences far from the translation start or stop sites (.1,000 bp), fusion transcripts and those having incomplete coverage by the genome. Similar results were shown in A. thaliana where around 13% of the ESTs could not be aligned to the predicted genes [29] and in human only 64% of the reads could be mapped to the RefSeq database of well annotated human genes [30]. Picorrhiza kurrooa had 14% of the ESTs that could not be aligned to its predicted genes [31].

Functional annotation and classification of C. borivilianum transcriptome
Chlorophytum borivilianum transcripts were assigned GO terms to describe functions of genes and associated gene products into three major categories namely, biological process, molecular function, and cellular component, and their sub-categories [24]. A total of 30,625 out of 50,826 assembled sequences yielded significant annotation against GO database. These genes were further classified into three major categories namely, biological process, molecular function, and cellular component using plant specific GO slims that broadly provides an overview of the ontology content. Functional classification of C. borivilianum transcripts in biological process category (Fig 3) showed that metabolic process (GO: 0008152), response to stimulus (GO: 0050896), nucleobase containing compound metabolic process (GO: 0006139) and extracellular structural organization (GO: 0009987) were among the highly represented groups indicating that the plant is undergoing rapid growth and extensive metabolic activity. In cellular component group, sequences related to the intracellular regions (GO: 0005622) and membrane regions were wellrepresented categories (GO: 0016020) (Fig 3). Transcripts belonging to major subgroups of molecular function category included binding functions (GO: 0005488), transferase activity (GO: 0016740), catalytic activity (GO: 0003824) and oxidoreductase activity (GO: 001649) (Fig 3). These GO annotations provide a comprehensive information on C. borivilianum expressed genes that are encoding diverse proteins.
Best EC classification was obtained for 14,998 assembled sequences, which were annotated with 15,866 enzyme codes. Interestingly, a large amount of assembled transcripts belonged to non-specific protein tyrosine kinases enzyme class alone (14.7%). In arabidopsis, this enzyme controls shoot and floral meristem size [32] and also contributes to signal transduction [33]. Presence of this enzyme in such abundance may suggest that the plant is in active metabolic phase.
We also compared KEGG pathways mapping of C. borivilianum with KEGG mapping of O. sativa and A. thaliana. Sample score corresponds to the transcripts of C. borivilianum mapped on KEGG pathways and background score is the total genes in the genome of the background species mapped on KEGG pathways. We found 50,034 transcripts showing similarity with GO database at e-value cut-off of 1e 205 . These transcripts were used for annotation of KEGG pathways. These transcripts were annotated with 257 KEGG pathways which were present in O. sativa and A. thaliana. A total of 236 pathways were common in both of background species. 10 unique pathways of O. sativa and 11 of A. thaliana were also mapped. Total 3,805 transcripts were mapped on these KEGG pathways common with O. sativa and 3,806 transcripts were mapped on KEGG pathways common with A. thaliana. Out of these 43 KEGG pathways were significantly enriched (p-Value ,0.05) with the genes present in O. sativa. Non-homologous end joining, Lipopolysaccharise biosynthesis, Spliceosome, RIG-I like receptor signaling pathway, mineral absorption, GPI-anchor biosynthesis, RNA transport, Basal transcription factors, N-Glycan biosynthesis and RNA degradation are top enriched pathways. In comparison to A. thaliana, 42 pathways were identified as enriched (p-value ,0.05). Spliceosome, RNA degradation, RNA transport, Pyrimidine metabolism and GPI anchor biosynthesis were found to be most enriched pathways in comparison with A. thaliana. A total of 100 transcripts in C. borivilianum were mapped on spliceosome KEGG pathway compared to 106 genes of O. sativa and 117 genes of A. thaliana. Highest number transcripts (106) were mapped to Ribosome pathways in C. borivilianum compared to 257 genes of O. sativa and 311 genes of A. thaliana. Other pathways with highest transcripts mapping include RNA transport (89), Purine metabolism (79) and protein processing in endoplasm (73), thus suggesting that the plant was in actively growing stage at the time of harvesting. Besides all these pathways, flavonoid biosynthesis and steroid biosynthesis were the most enriched pathways in C. borivilianum. The enrichment of these pathways clearly suggested that the plant possess antioxidant properties, anti-arthritic properties, and anticancer properties. Pathways comparison (Fisher's Exact P-value ,0.05) with respect to the background species O. sativa and A. thaliana is shown in Fig 5a and 5b. All the data of pathways mapping is provided in Table S3A and Table  S3B for O. sativa and A. thaliana respectively.

Discovery of the transcripts encoding putative transcription factors
Transcription factors interact with the promoter regions of a gene and modulate its expression. Associative modulation of several plant processes suggests involvement of transcription factors (TFs) for coordinated regulation of gene expression. By sequence comparison with known transcription factor gene families, 8,369 putative C. borivilianum transcription factor genes, distributed in at least 62 families, were identified representing 8.3% of C. borivilianum transcripts (Fig 6). These genes covered transcription factor gene families like MYB, MADS, Orphans, basic Helix-Loop-Helix (bHLH), bZIP, WRKY and many more. Among all these TF gene families, C3H, PHD and MYB were most abundant families (Table S4). MYB family proteins are characterized by DNA-binding domains and it is the largest transcription factor family in Arabidopsis thaliana as well, where it comprises of 163 genes [34]. These TFs have been associated with varied processes. It has been observed that many protein members of the MYB, bZIP and WRKY transcription factor families insinuates the regulation of stress responses [35]. Members of C3H family are involved in embryogenesis [36] and members of PHD family TFs are involved in vernalization processes [37]. bHLH members are involved in controlling cell proliferation and development of specific cell lineages [38] and bZIP regulates processes including pathogen defense light and stress signaling seed maturation and flower development. A complete picture about evolution of various transcription factor families will emerge once the complete genome sequence of C. borivilianum will be available.

GC content and SSRs analysis from the transcriptome of C. borivilianum
Deep sequencing of C. borivilianum transcriptome rendered us with an opportunity to calculate the ratio of GC contents. Knowledge of GC content helps in understanding ecology and evolution of particular taxa. It also plays a vital role in gene and genome regulation and also helps in determining the physical properties of the genome. It is an important indicator of stability of DNA [39]. Average GC content of C. borivilianum transcripts was 44% ( Figure S1). This is in range with the GC levels of coding sequences of O. sativa (43.6%) which also belong to monocot family.
Assembly of C. borivilianum was further assessed for the molecular markers. Morphological as well as biochemical markers are used in the authentication of herbal drugs. SSRs or microsatellite markers are polymorphic stretches of 1 to 6 nucleotide units repeated in tandem and randomly spread in eukaryotic genomes. SSRs are generally associated with functional and phenotypic variations. Moreover, SSRs allow identification of many alleles at a single locus, are easy to develop, distributed evenly all over the genome and are co-dominant [40]. SSRs are particularly useful when genome information of the crop is lacking. Since C. borivilianum is a cross pollinated species [41], and hence seed raised population will have variability.
For identification of SSRs, transcripts of C. borivilianum were analyzed with perl script MISA. 3,487 SSRs were identified in 3,321 (3.28%) transcripts comprehensively out of which, 160 sequences contained more than 1 SSR. With a frequency of over 43.1% (1,503/3,321), di-nucleotides were most abundant of all the SSRs obtained. Tri-nucleotides were more than 32.89% (1,147/  3,487), followed by tetra-nucleotides (0.4%, 15/3,487) and pentanucleotides (0.05%, 2/3487). Most prevalent of mononucleotide was poly-A. The edge of poly-A mono-nucleotide SSRs over others may be due to the presence of poly-A tails in the RNA sequences (Table S5). Di-nucleotide SSRs were the most abundant in the transcripts, which is similar to results obtained from other plants [42]. Among the di-nucleotide repeat classes, AG/GA/CT/ TC (90.7%) was the most frequent dimer motif. Other dimer motifs included TG/CA and AT/TA (Table S5). CG repeats were infrequent in the plant (0.6%), which is consistent with previous observations [42,43,]. Among tri-nucleotide repeats, CTC/CCT/ GAG/GAA was the largest repeat class followed by CTT/AAG/ GGA/AGA (Table 3). These findings indicated that unique sequences containing SSR markers were indeed abundant in C. borivilianum. Di-and tri-nucleotide SSR motifs and the number of repeats are presented in Table 3. In particular, several SSR motifs were linked with the unique sequences encoding enzymes (e.g. MVK, DXS, SqS) involved in saponin biosynthesis (Table 4). These unique sequence-derived markers generated in this study represent a valuable genetic resource for future studies in this and related species.

Analysis of metabolic pathway genes
Saponins present in C. borivilianum are the primary source of its significant medicinal properties and are synthesized by mevalonate and non-mevalonate pathways in plants. TLC analysis indicated that significant levels of saponins were present in C. borivilianum tissues ( Figure S2). Nearly 10 spots were visualized on TLC plate upon derivatization with anisaldehyde -sulfuric acid reagent. Previous studies reported that initial reactions of isoprenoid biosynthetic pathway occur in leaves, while later step modifications and storage of saponins occurs in roots [13,14], thus the amount of saponins is high in roots. There are also few reports on the presence of flavonoid compounds in C. borivilianum [44]. These reports very well validate the presence of saponins and flavonoids in C. borivilianum, thus, our interest was to identify the genes responsible for biosynthesis of saponins and other metabolites in our dataset. The genes related to following pathways/compounds were identified: 3.7.1. Saponin biosynthesis pathway. Saponins are derived from geranyl pyrophosphate (GPP). GPP is synthesized by sequential head to tail addition of isopentenyl pyrophosphate (IPP) and its allelic isomer dimethylallyl pyrophosphate (DMAPP) [45]. IPP and DMAPP are synthesized via the cross talks between the cytosolic mevalonate (MVA) pathway and plastidial 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway [46]. MVA pathway starts from the condensation of acetyl-CoA [47], whereas MEP pathway needs pyruvate and _lyceraldehydes 3-phosphate [48]. In C. borivilianum transcriptome dataset, multiple transcripts encoding almost all known enzymes involved in the MVA pathway, MEP pathway and saponin biosynthesis pathway were identified (Fig 7a). In almost all the cases, more than one unique sequence was annotated as same enzyme. These unique sequences may either represent different fragments of a single transcript or different members of a gene or both. MVA pathway starts with the condensation of three molecules of acetyl-CoA while in methylerythritol phosphate (non-mevalonate) pathway, isopentenyl diphosphate (IPP) is formed from pyruvate and _lyceraldehydes-3phosphate. Both biosynthetic routes yield the activated isoprene units, dimethylallyl diphosphate (DMAPP) and IPP [49]. Genes involved in MVA pathway that were found in our transcriptome included acetyl CoA-acetyltransferase (EC 2.3.1.9, 5 unigenes), HMG-CoA reductase (EC 1.1.1.34, 13 1.33, 2 unigenes). MEP pathway genes are also converted to IPP by reaction catalyzed by series of enzymes. These included 1-deoxy-D-xylulose-5-phosphate synthase (EC 2.2.1.7, 18 unigenes), 1-deoxy-D-xylulose-5-phosphate reductoisomerase (EC 1.1.1.267, 3 unigenes), 2-C-methyl-D-erythritol 4-phosphate  1.21, 9 unigenes). Subsequently, squalene monooxygenase (EC 1.14.13.132, 14 unigenes) catalyzes the conversion of squalene to 2, 3-oxidosqualene. The cyclization event of 2, 3-oxidosqualene is catalyzed by a class of enzymes, OSCs. Cyclization of 2, 3-oxidosqualene is rate limiting step and this event is also the branch point for sterol and triterpenoid biosynthesis in many plants [50]. Three OSC genes of C. borivilianum, cycloartenol synthase (E.C. 5.4.99.8, 16 unigenes) [51], b-amyrin synthase (E.C. 5.4.99.39, 1 unigene) [52] and Dammarenediol II synthase (E.C. 4.2.1.125, 1 unigene) [53] exist in our dataset. Steroidal saponins, borivilianosides, are the major type of saponins present in C. borivilianum [6]. As yet, no reports reveal the presence of triterpenoid type of saponins in C. borivilianum. However, two singleton sequences (Transcript ID: 664097 and 798016) matched with b-amyrin synthase and Dammarenediol II synthase of V. vinifera and R. communis respectively.
Amongst reactions for the modification of secondary metabolites, glycosylation plays an important role in plants, contributing to biosynthesis and storage of secondary metabolites [67]. UDP-glycosyltransferases (UGTs) catalyze the transfer of sugar residues from uridine diphosphate sugars to an acceptor. Several unigenes encoding for different type of glycosyltransferases were found in our dataset (Table S7). Out of these, UGTs supposed to be involved in saponin biosynthesis included soyasaponin III rhamnosyltransferase (EC 2.4.1.273, 4 unigenes) and sterol 3-beta-glucosyltransferase (EC 2.4.1.173, 9 unigenes). Soyasaponin III rhamnosyltransferase is involved in the biosynthesis of soyasaponin I in Glycine max [68]. The enzyme has strong sugar donor specificity for UDPrhamnose. It does not show any activity with UDP-glucose, UDP-galactose or UDP-glucuronic acid. Sterol 3-beta-glucosyltransferase on the other hand is involved in the glycosylation activity of different type of sterols (e.g. sitosterol, stigmasterol etc.) present in the plant. This enzyme is thought to be involved in steroid metabolism.
UGTs involved in the glycosylation of flavonoids were also present in C. borivilianum transcriptome dataset, some of which were   Presence of such glycosylation enzymes and cytochrome P450 genes might contribute to extensive modifications of various secondary metabolites in C. borivilianum. 3.7.4. Alkaloid biosynthesis pathway. Alkaloids are heterocyclic nitrogen compounds synthesized from amino acids and are the largest groups of natural plant products [69]. Potent biological activity of some alkaloids has led to their exploitation as pharmaceuticals, stimulants, narcotics and poisons. Plant-derived alkaloids currently in clinical use include, but not limited to, analgesics morphine and codeine, anticancer agents vinblastine and taxol, gout suppressant colchicine, muscle relaxant (C)tubocurarine, antiarrythmic ajmaline, antibiotic sanguinarine, and sedative scopolamine. Other important alkaloids of plant origin include caffeine, nicotine, cocaine, and synthetic O, O-acetylated morphine derivative heroin.
In C. borivilianum transcriptome dataset, in addition to saponins and flavonoids, we also discovered genes involved in the biosynthesis of benzylisoquinoline alkaloids (BIA). Amino acid tyrosine is the precursor of benzylisoquinoline alkaloids. BIA biosynthesis begins with a metabolic lattice of decarboxylations, ortho hydroxylations and deaminations that convert tyrosine to both dopamine and 4-hydroxyphenylacetaldehyde [70]. This reaction is catalyzed by tyrosine decarboxylase (TYDC; EC 4.1. 1.25, 8 unigenes), the only enzyme involved in these early steps that has been purified [71], and for which the corresponding cDNA has been cloned [72,73] (Fig 7c). TYDC cDNAs have also been reported from parsley [74] and Arabidopsis thaliana [75], which do not accumulate tyrosine-derived alkaloids. TYDC mRNAs were shown to be rapidly induced in response to elicitor treatment [74,75] and pathogen challenge [76] in various plants. Induction of TYDC mRNAs in parsley and Arabidopsis suggests that, in addition to BIAs, tyramine serves as precursor to a ubiquitous class  [80]. It contains cytotoxic steroidal glycoside saponinchloromaloside-A and spirostanolpentaglycosides embracing beta-Dapiofuranose which are used for their anticancer properties [81].
In the present study, for the first time, we reported the presence of genes involved in taxol biosynthesis. 39-N-debenzoyal-29deoxytaxol Nbenzoyaltransferase, taxane 10 -beta -hydroxylase (5-a-taxadienol-10-bhydroxylase) and deacetyl baccatin genes were discovered that are involved in taxol biosynthetic pathway. Few genes namely 6 neomenthol dehydrogenase and pulegone reductase, involved in menthol biosynthesis pathway, were also discovered. Menthol biosynthetic genes are primarily involved in plant defense, although there are evidences that menthol has a potent anticancer property, effecting cell death through TRPM8 receptor [82]. Further studies on this pathway could be beneficial as the combination of C. borivilianum with other herbs could be beneficial for cancer treatment.

Quantification of C. borivilianum transcripts
All the reads were mapped onto the non-redundant set of C. borivilianum transcripts which revealed that the number of reads corresponding to each transcript ranged from 2 to 13289 with an average of 94 reads per transcript indicating a very wide range of expression in transcripts of C. borivilianum (Table S8). Minimum coverage (RPKM) of a C. borivilianum transcript was 3.38 and maximum of 4328.27 with an average of 55.78 ( Figure S3). Low expressing transcripts were also present in our assembly. We also studied expression of various genes that were involved in the saponin biosynthesis and flavonoid biosynthesis pathways. We found that expression of unigene 807650 annotated as Acetyl Coacetyl transferase (AACT), unigene662091 annotated as HMG-CoA synthase, unigene 757982 annotated as HMG-CoA reductase and unigene724460 annotated as Squalene monooxygenase was highest among saponin biosynthesis pathway genes (Table 4). AACT participates in 10 different metabolic pathways and HMG-CoA, synthesized by HMG-CoA synthase is an intermediate both in the saponin biosynthesis and in ketogenesis. This may be the reason of its high abundance in C. borivilianum transcriptome dataset. The expression of cycloartenol synthase (EC 5.4.99.8) (100 reads) was 5 fold higher than that of b-amyrin synthase (21 reads) (EC 5.4.99.39) ( Table 5) thus, further confirming the diversion of the pathway towards the synthesis of steroidal type of saponins in this plant.
Most of the genes of mevalonate pathway had expression greater than the genes of non-mevalonate pathway (Table 5). It might be due to the fact that mevalonate pathway is more active in the production of saponins in C. borivilainum as compared to nonmevalonate pathway as is the case commonly seen in monocot plants. In case of flavonoid biosynthesis, expression of 4-coumarate CoA ligase, dihydroflavonol 4-reductase, flavonol synthase, flavanone 3hydroxylase was the highest (Table 6).

Conclusion
In conclusion, identification of genes involved in saponin and flavonoid biosynthesis will contribute to future functional studies in the plant and provide a basis for improving production levels in plants or in microbial hosts by metabolic engineering. In our data set, we also identified transcripts that encode for enzymes involved in anticancer and plant defense properties. Further, we annotated a large number of genes involved in the various pathways. Dataset of assembled C. borivilianum unigenes presented here will provide the foundation for other functional and comparative genomic studies.