De novo Transcriptome Assembly and the Putative Biosynthetic Pathway of Steroidal Sapogenins of Dioscorea composita

The plant Dioscorea composita has important applications in the medical and energy industries, and can be used for the extraction of steroidal sapogenins (important raw materials for the synthesis of steroidal drugs) and bioethanol production. However, little is known at the genetic level about how sapogenins are biosynthesized in this plant. Using Illumina deep sequencing, 62,341 unigenes were obtained by assembling its transcriptome, and 27,720 unigenes were annotated. Of these, 8,022 unigenes were mapped to 243 specific pathways, and 531 unigenes were identified to be involved in 24 secondary metabolic pathways. 35 enzymes, which were encoded by 79 unigenes, were related to the biosynthesis of steroidal sapogenins in this transcriptome database, covering almost all the nodes in the steroidal pathway. The results of real-time PCR experiments on ten related transcripts (HMGR, MK, SQLE, FPPS, DXS, CAS, HMED, CYP51, DHCR7, and DHCR24) indicated that sapogenins were mainly biosynthesized by the mevalonate pathway. The expression of these ten transcripts in the tuber and leaves was found to be much higher than in the stem. Also, expression in the shoots was low. The nucleotide and protein sequences and conserved domains of four related genes (HMGR, CAS, SQS, and SMT1) were highly conserved between D. composita and D. zingiberensis; but expression of these four genes is greater in D. composita. However, there is no expression of these key enzymes in potato and no steroidal sapogenins are synthesized.


Background
Steroidal sapogenins exist naturally as aglycones of steroidal saponins (a large family with enormous structural diversity) which have attracted a great deal of interest lately due to their wide spectrum of biological and pharmacological functions [1][2]. Steroidal saponins are among the more widely distributed secondary metabolites within the plant kingdom [3]. de novo assembled by using Trinity software (v2012- [10][11][12][13][14][15] [17]. Finally, 131,655 transcripts with length above 200 bp were obtained. The average length of these transcripts was 1,368 bp and the N50 length was 2,303 bp. There were 62,285 transcripts with length 1,000 bp, 31,868 transcripts with length 2,000 bp, and the max length was 25,464 bp. The longest transcript of each gene was selected as the unigene, and we generated 62,341 unigenes above 200 bp. The average length of these unigenes was 724 bp, and the N50 length was 1,228 bp. There were 14,425 unigenes 1,000 bp, 5,975 unigenes 2,000 bp, and the max length was 22,744 bp (Table 1). Common perl scripts were used to conduct the statistical analysis.
To evaluate the final assembly, the ratios of the conserved domain sequences (CDSs) with different lengths containing transcripts to the total number of transcripts were calculated. For example, there were 17,346 transcripts containing long-CDSs with length 1,500 bp, so the ratio was 13.2% (compared with 131,655, the total number of transcripts-see Table 2). In the final results, 83,114 transcripts contained CDSs with length 300 bp (a ratio of 63.1%).
Additionally, the authors aligned the 396 nucleotide sequences of Dioscorea from the National Center for Biotechnology Information (NCBI) with the total transcripts under the  Average length of Unigenes (bp) 724 Max length of Unigenes (bp) 22,744 condition that the blast length was greater than 200 bp. The number of sequences hit was 370 with an average sensitivity of 81.5%. These assessments suggest that the assembly was satisfactory.

Functional annotation and classification of D. composita transcriptome
To investigate the function of these unigenes, sequence similarity was compared with several databases: the non-redundant protein database (NR) and the Arabidopsis protein database and the Rice protein database using Blastx software (e values < 10 -5 ) [18], the Kyoto Encyclopedia of Genes and Genomes (KEGG) database using KAAS software (e values < 10 -10 ) [19]. The protein family and domain predictions were completed by using the protein family (Pfam) database andhmmscan software package(e values < 10 -2 ) [20]. A total of 27,720 unigenes (44.5%) were annotated using the public protein databases (see Table 3 and S1-S5 Tables). KEGG classification of the annotated unigenes was executed. The results indicated that 8,022 (12.9%) of the unigenes could be mapped to the KEGG database. The analysis revealed that: 864 (10.8%) of these unigenes are clustered into a 'cellular processes' section wherein the majority of the terms relate to 'transport and catabolism'; 719 (9.0%) are grouped into an 'environmental information processing' category and mostly enriched in 'signal transduction'; 2,009 (25.0%) of the unigenes are clustered in a 'genetic information processing' section in which more than half belong to a 'translation' sub-section; 3,866 (48.2%) are clustered into a 'metabolism' category and the majority of these relate to 'carbohydrate metabolism'; 1,316 (16.4%) of the unigenes can be grouped into a 'organism systems' category, where the largest sub-section is 'environmental adaptation' (see Fig 2).

Secondary metabolic pathway analysis
KEGG maps provide much information to help understand the high-level functions of a biological system. Metabolic pathway analysis enables us to realize which genes in the same pathway cooperate with each other as they exercise their biological functions. In all, 8,022 unigenes could be mapped with 243 metabolic pathways based on the KEGG blast analysis, and 531 unigenes were mapped into 24 secondary metabolic biosynthetic pathways (Fig 3). These secondary metabolic genes supplied further information for analyzing the secondary metabolite biosyntheses. Many unigenes which encode key enzymes in the biosynthetic pathways for phenylpropanoids, flavonoids, steroids, stilbenoid, diarylheptanoid and gingerol in D. composita were confirmed.

Steroidal sapogenin biosynthetic pathways
Steroidal sapogenins in plants were biosynthesized from two common C 5 isoprene unites: isopentenyl diphosphate and dimethylallyl diphosphate. These unites were synthesized by the cytosolic mevalonate (MVA) and plastial 2-C-methl-D-erythritol-4-phospate (MEP) pathways [21]. The MVA pathway played a major role in the production of steroidal backbones [22]. A total of 35 enzymes encoded by 79 unigenes were related to steroidal sapogenin biosynthesis in this transcriptome database (Table 4 and S6 Table). These almost cover the entire pathway. The protein families in which these unigenes are involved are listed in Table 4, the similarity of these unigenes with Arabidopsis and rice is presented in Table 5 and the primary biosynthetic pathway is illustrated in Fig 4. As shown in Fig 4, the complete biosynthetic pathway was divided into three parts according to the KEGG classification of this transcriptome: a steroidal backbone part, a sesquiterpenoid and triterpenoid part, and a steroid biosynthesis part. In the first part, the steroidal sapogenin backbones were biosynthesized from acetyl-CoA (C2) to farnesyl diphosphate (C15) in 17 steps which involved 23 enzymes and 47 unigenes. Farnesyl diphosphate was formed by dimethylallyl diphosphate and two isopentenyl diphosphate molecules in a sequential head-totail condensation. This process was the same as in terpenoid sapogenin synthesis. Within the sesquiterpenoid and triterpenoid biosynthesis section, two farnesyl diphosphate molecules were condensed tail-to-tail style [21]. This was followed by oxidation to form squalene-2,3-epoxide (C30) in 3 steps-2enzymes and 2 unigenes participate in these steps. In contrast to terpenoid sapogenin biosynthesis, squalene-2,3-epoxide transformed to a chair-boat-chair-boat conformation but not a chair-chair-chair-boat one. In the steroidal biosynthesis part, squalene-2,3-epoxide with chair-boat-chair-boat structure was cycled to cycloartenol (C30) which was regarded as the precursor to the steroidal sapogenins [23][24][25]. Then, the steroidal sapogenins (C28) which are produced underwent various modifications through the action of many enzymes, e.g. cytochrome P450, oxidase, reductase, isomerase, transferase, etc [26]. However, the specific steps involved in steroidal sapogenin biosynthesis from cycloartenol were unclear in the plants of the Dioscoreaceae family. Expression profiles of 10 transcripts related to the biosynthesis of steroidal sapogenin D. composita tuber was used as raw material to extract sapogenin. It is interesting to discover whether or not related gene expression is correlated with developmental stages and tissues. The authors used relative quantitative RT-PCR to analyze the expression profiles of 10 transcripts related to steroidal sapogenin biosynthesis using material from the leaves, stems, and tubers of 18-month-old plants. The results are shown in Fig 5. Results show that all the genes selected, except HMED and CYP51, were expressed in the tuber, leaf, and stem of the plant. Five genes (HMGR, MK, SQLE, CYP51, and DHCR7) exhibited their largest expression levels in the tuber. The other five genes (FPPS, DXS, HMED, CAS, and DHCR24) had larger expression levels in the leaves. The expression levels in the tubers and leaves of all the genes selected were much higher than in the stems. This result reveals that the plants produced a variety of compounds by this pathway and their functions were different. Some of the products related to membrane components, e.g. phytosterols, so the related genes were expressed in all tissues [27]. Other compounds, like sapogenins, functioned to counteract pathogens and herbivores, and were mainly biosynthesized and stored in the tubers. The leaves were able to metabolize the triterpenic constituent of leaf wax in their defensive glands [28]. The expression of the genes involved in the MVA pathway (HMGR and MK) was much greater than the genes involved in the MEP pathway (DXS and HMED). This difference between the two pathways indicated that the biosynthesis of steroidal sapogenins chiefly occurred via the MVA pathway in D. composita.
Our study also compared the expression patterns in 1-month-old shoots as well (Fig 5). Compared to 18-month-old plants, 9 genes (all except DXS) were expressed in 1-month-old shoots at much lower levels. These results reveal that the secondary metabolism was not prevalent at the seedling stage which suggests the saponins might not be required for seedling development and survival [29]. The expression of DXS in shoots was higher than in all organs of the 18-month-old plants, which indicates that DXS played a role in producing gibberellins which improved the growth of leaves and budlets [15].

The differences between D. composita and other tuberous plants
To investigate why steroidal sapogenins are synthesized in D. composita more copiously than in other tuberous plants, the gene sequences, protein sequences, conserved domains, and expression quantities of steroids in several tuberous plants were analyzed.

Steroidal sapogenins content
The contents of steroidal sapogenins in D. zingiberensis, Chinese yam (Dioscorea opposita Thunb.), potato (Solanum tuberosum L.), and cassava (Manihot esculenta) were compared with that in D. composita (Fig 6). According to the measurements, the steroidal sapogenins content in D. composita was 1.56 times and 52.6 times that of D. zingiberensis and Chinese yam, respectively. There were no steroidal sapogenins determined in potato and cassava.

Nucleotide sequences, protein sequences, and conserved domains
Four steroidal sapogenin synthesis-related enzymes: HMGR, CAS, SQS and SMT1 (sterol 24-Cmethyltransferase) were chosen for comparison in D. composita, D. zingiberensis, and potato (there was no relevant data published for Chinese yam and cassava). The results are shown in Tables 6-8. HMGR was involved in a rate-limiting step in the isoprene biosynthesis via the MVA pathway. CAS was a key enzyme in steroidal sapogenin precursor synthesis. SQS produced the triterpene and tetraterpene precursors for many diverse sterol and carotenoid end products. SMT1 was 'in charge' of protein structure modification. All the blast analyses were completed online through the NCBI website, and the conserved domains of HMGR, CAS, SQS, and SMT1 in D. composita were calculated online using CDD v3.11-45746 PSSMs on the NCBI site.
For HMGR, CAS, SQS, and SMT1, the gene sequences (above 89%), protein sequences (above 90%), and conserved domains (above 79%) were similar in D. composita and D. zingiberensis. This suggests that the four enzymes were produced with quite consistent structure and function in closely related plants like D. composita and D. zingiberensis, so the related metabolic pathways and products might be extremely similar. In potato, the protein sequences and conserved domains of HMGR, CAS, and SQS were more dissimilar and SMT1 had little similarity with D. composita-the gene sequences had no similarity with D. composita. This means that HMGR, CAS, and SQS had some structural similarity in D. composita and potato, but the products in potato might involve carotenoids or other lipids rather than steroidal sapogenins [29]. The structure and function of SMT1 in potato were different from that in D. composita, especially in the Sterol_MT_C domain. SMT1 had another conserved domain which used adenosylmethionine as a substrate for methyl transfer. Thus, it might catalyze the methionine methyl transference instead of the sterol in potato.

Expression of four specific genes
The expression of 4 genes related to steroidal sapogenin biosynthesis in the tuber of D. composita, D. zingiberensis, and potato was analyzed by relative quantitative RT-PCR. The results are shown in Fig 7. The expression of HMGR, CAS, SQS, and SMT1 in D. composita was found to be 2.0, 159.1, 3.4, and 8.9 times that in D. zingiberensis, respectively. Therefore, steroidal sapogenin biosynthesis was more rapid and efficient in D. composita. The natural role of steroidal sapogenins is to confer protection against potential pathogens [9]. It is thought it might play a  part in defense-related signaling processes in the presence of pathogen infection [30]. D. zingiberensis has been cultivated as a medicine in favorable planting conditions for several decades,  so the capacity of its defense response might have degenerated without frequent pathogen accumulation. Conversely, D. composita has been living in the wild and has better maintained its defense capability. The expressions of HMGR, SQS, and SMT1 in D. composita were 2.8, 7.3, and 27.9 times greater than in potato, respectively. CAS was hardly expressed in potato at all. There were no steroidal sapogenins synthesized in potato because it lacked a key enzyme in its metabolic pathway. In fact, all the data related to CAS in potato in the public domain had been predicted though genome sequencing [31]. CAS might exist as a non-functional gene in potato. Evolution, maybe, was one of the important reasons for this difference (different genera of plants producing different antimicrobial compounds to resist different environments in the course of evolution).

Conclusions
Steroidal sapogenins are generally considered to be part of the defense system in plants. They have recently been attracting some interests due to their wide spectrum of biological, pharmacological, economical, and medicinal importance. Some species of Dioscoreaceae, like D. zingiberensis, are utilized mainly as a material to extract steroidal sapogenins for the medicinal industry [5]. As a novel resource, the quantity and quality of steroidal sapogenins in D. composita are exceptional. Determination of the genes involved in this secondary metabolism should allow researchers to improve the production of its effective constituents by the various techniques of molecular biology and metabolic regulation. Despite the recent great improvements in sequencing technology, no genomic data were available for consultation for any of the Dioscoreaceae species.
In this work, we generated 27,720 unigenes through transcriptome sequencing, and predicted a pathway for steroidal sapogenin backbone biosynthesis in D. composita. Of these unigenes, 531 were obtained that relate to secondary metabolism, of which, 79 unigenes were associated with steroidal sapogenin biosynthesis. There were 21 steps in the pathway from acetyl-CoA to cycloartenol, the precursor of steroidal sapogenin biosynthesis [32]. A variety of steroidal sapogenins were produced through complex modification of cycloartenol, for example, the catalysis of stereospecific P450 cytochromes, but this process was unclear [33]. The expression profiles of 10 related transcripts revealed that all the organs involved in this pathway, gene expression was the largest in the tubers and leaves. This was because the products of this pathway included membrane and defense constituents. The low expression observed in shoots suggested that the secondary metabolism was not prevalent at the seedling stage.
Through a comparison with D. zingiberensis, the expression of 4 related genes were found to be the greater in D. composita, and result in greater enrichment of steroidal sapogenins in D. composita. This may be caused by the stronger defense response in D. composita. No expression of key enzymes led to a lack of steroidal sapogenins in potato. This preliminary information allows us to understand the biosynthetic pathway of steroidal sapogenins in D. composita, and should serve as a foundation for manipulating the secondary metabolism to produce healthpromoting ingredients. In addition, the transcriptome data we have derived constitutes a much more abundant genetic resource that can be utilized to benefit further in-depth studies on D. composita.

Plant material
No specific permissions were required for the field studies in Shaoguan, Guangdong Province, China. It was, however, confirmed that these studies did not involve endangered or protected species.
The GPS coordinates of the location of the field studies is 24.305984559949003, 113.96962703698728. D. composita samples were collected from 18-month-old plants and 1-month-old shoots (counted from field planting) that were cultivated in Shaoguan. For the 18-month-old plants, leaves, stems, and tubers were separately harvested from six plant individuals. Samples were cut and mixed using samples from the top, middle, and bottom parts of the same tissues from an individual. For the 1-month-old shoots, six whole plants were collected. All samples were immediately frozen in liquid nitrogen after cleaning with germ-free water, and stored at-80°C before use.
D. zingiberensis tuber samples were collected from 18-month-old plants, Chinese yam tuber samples were collected from 9-month-old plants, potato tuber samples were collected from 6-month-old plants, and cassava tuber samples were collected from 12-month-old plants. All of these were cultivated in Shaoguan. For each kind of plant, samples were separately harvested from six individuals, cut, and the samples from the top, middle, and bottom parts of the tuber from each individual mixed.

RNA isolation and sequencing
Total RNA was isolated by using a TIANDZ Column Plant RNAout 2.0 (with DNase) from various organs and development stages, respectively. The quality and concentration of RNA were detected by electrophoresis and a Backman Counter (DU730). Parts of the RNA from different organs (18-month-old plants) were mixed together in equal measure and sent to NOVOGENE (Beijing, China) where the cDNA library was established and Illumina sequencing was completed.
The remaining RNA from different tissues (18-month-old plants), RNA extracted from 1-month-old shoots, and RNA from different plants were used for real-time PCR analysis. About 1 μg of total RNA was reverse transcribed into single-stranded cDNA using a TaKaRa PrimeScript RT reagent kit with gDNA Eraser (Perfect Real Time) for each sample. The cDNA solutions were diluted 10 times for templates in real-time PCR.

Steroidal sapogenin determination
Freeze-dried powdered tuber (1 g) was hydrolyzed with hydrochloric acid (2 M) by boiling in a water bath for 4 h. This was filtered and the residue washed to neutral. It was then dried at 80°C. The residue was put into a soxhlet extractor with filter paper package, refluxed with petroleum ether (80°C) for 4 h. The petroleum ether was eevaporated and the residue dissolved in chloroform, diluting it to 10 mL.

Real-time PCR and expression analysis
The primers of 12 selected transcripts were designed online using Primer-NCBI. The quantitative reactions were finished with TaKaRa SYBR Premix ExTaq II (TliRNaseH Plus) on Eppendorf Mastercycler Real-Time Thermal Cylers. PCR conditions: 95°C for 2 min, followed by 40 cycles of 95°C for 15 s, 60°C for 15 s, and 72°C for 20 s. The expressions of selected genes were normalized against glyceraldehyde-3-phosphate dehydrogenase (GAPDH) which was an internal reference gene. Moreover, the relative expression was counted using the 2 -ΔΔCt method. All real-time PCR experiments were repeated using three biological, and three technical, replications.