Transcriptome Sequencing of Codonopsis pilosula and Identification of Candidate Genes Involved in Polysaccharide Biosynthesis

Background Codonopsis pilosula (Franch.) Nannf. is one of the most widely used medicinal plants. Although chemical and pharmacological studies have shown that codonopsis polysaccharides (CPPs) are bioactive compounds and that their composition is variable, their biosynthetic pathways remain largely unknown. Next-generation sequencing is an efficient and high-throughput technique that allows the identification of candidate genes involved in secondary metabolism. Principal Findings To identify the components involved in CPP biosynthesis, a transcriptome library, prepared using root and other tissues, was assembled with the help of Illumina sequencing. A total of 9.2 Gb of clean nucleotides was obtained comprising 91,175,044 clean reads, 102,125 contigs, and 45,511 unigenes. After aligning the sequences to the public protein databases, 76.1% of the unigenes were annotated. Among these annotated unigenes, 26,189 were assigned to Gene Ontology categories, 11,415 to Clusters of Orthologous Groups, and 18,848 to Kyoto Encyclopedia of Genes and Genomes pathways. Analysis of abundance of transcripts in the library showed that genes, including those encoding metallothionein, aquaporin, and cysteine protease that are related to stress responses, were in the top list. Among genes involved in the biosynthesis of CPP, those responsible for the synthesis of UDP-L-arabinose and UDP-xylose were highly expressed. Significance To our knowledge, this is the first study to provide a public transcriptome dataset prepared from C. pilosula and an outline of the biosynthetic pathway of polysaccharides in a medicinal plant. Identified candidate genes involved in CPP biosynthesis provide understanding of the biosynthesis and regulation of CPP at the molecular level.


Ethics statement
The study was carried out on a private land and Guo Feng Zhao should be contacted for future permissions. The location is not protected in any way, and the field studies did not involve endangered or protected species.

Plant material
Two-year-old C. pilosula planted in a field, where good agricultural practices were implemented, was harvested and used as plant material in 2012 and 2013. The field was located in Lingchuan, Shanxi, China (1,522 m altitude, 35°47 0 N, 113°24 0 E).
Previous studies have shown that the level of CPPs in root tissue of C. pilosula increases steadily during the growing season [25,26]. At flowering and boll-forming stage (Fig. 1B), the content of CPPs was the highest in roots, and the content in flower buds was relatively higher than in other aerial parts (S1 Fig.). Thus, to comprehensively generate the C. pilosula transcriptome, flower buds, stems, leaves, as well as roots at this stage ( Fig. 1B) were collected in the midafternoon of July 18, 2012. The other root materials were sampled in the midafternoon of June 18, August 8, and September 5 in 2012. At these sampling points, the plants were at the following growth stage of the year: tendril-pulling stage, flowering stage, seed maturity stage. Morphological characterization of the plants at each of these stages was presented in Fig. 1. After collection, the specimens were first washed with tap water, dried using filter paper, and then quickly cut into small pieces on ice before being snap-frozen in liquid nitrogen. The samples were stored at -70°C for library preparation and RNA-sequencing (RNA-seq).
For real time PCR analysis, leaves, stems, flower buds, and roots were collected in the midafternoon of July 27, 2013, at the flowering and boll-forming stage of plants. They were sampled, washed, and stored using the same method as that described above.

Library preparation and sequencing (mRNA-seq)
Total RNA was isolated from each sample using a TRIzon kit (CWBIO, Beijing, China) following the manufacturer's instructions. DNA contamination was removed using a Qiagen RNase free DNase I (Qiagen, Hilden, Germany). The integrity of isolated RNA was verified by agarose gel electrophoresis. RNA was quantified using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific Inc., Wilmington, USA). Equal quantities of high-quality RNA from each sample were pooled and used for cDNA synthesis.
The cDNA library was constructed using a mRNA-Seq Sample Preparation Kit (Illumina Inc., San Diego, USA) according to the manufacturer's recommendations. Briefly, mRNA was purified from the total RNA by using mRNA-Seq Sample Preparation Kit. The purified mRNA was then fragmented into small pieces by using fragmentation buffer and the first strand cDNA was synthesized from fragmented mRNA by using random hexamers as primers. After the synthesis of the second strand cDNA using DNA polymerase I (Invitrogen, Carlsbad, CA, USA) and RNase H (New England Biolabs Inc., Ipswich, MA, USA), a 3 0 -adenine was added to the terminus for further ligation of Solexa adapters. Ligation products were separated on a 2% TAE agarose gel. A specific region (200 ± 25 bp) of the gel was excised and the cDNA contained in the excised gel was purified using a gel extraction kit (Qiagen). These fragments were used for 12 rounds of PCR amplification with primers complementary to the previously ligated adaptors. The final mRNA-seq libraries were constructed after the products were purified using a QIAquick PCR kit (Qiagen). The mRNA-seq libraries were sequenced on an Illumina Genome HiSeq 2000 platform (BioMarker Technologies Co., Ltd, Beijing, China).

De novo assembly and functional annotation
Clean reads were obtained by removing the adapter and low quality sequences as well as sequences that were less than 60 bases in length. De novo assembly was carried out using Trinity [27]. The K-mer value was set at 25 while other parameters were set at default levels. Redundancy removal was further performed with sequence clustering software TGICL to acquire non-redundant unigenes [28]. The unigenes obtained after gene family clustering were composed of clusters (corresponding prefix was CL), which contained several unigenes and singletons (the prefix was Unigene). Similarity between unigenes within one cluster was higher than 70%.

Prediction of the protein-coding region
For predicting the protein-coding region, the unigenes were first aligned (E-value <10 -5 ) to protein databases in the priority order of NR, Swiss-Prot, KEGG, and COG. The unigenes aligned to a database of higher priority were no longer aligned to a lower priority database. Proteins in the blast results with the highest ranks were selected and the coding sequences (CDSs) as well as the amino acid sequences were determined using standard codons as references. Unigenes that did not align to any available database were scanned using ESTScan, which produced nucleotide and amino acid sequences of the predicted coding region [34].

Analysis of the genes involved in the biosynthesis of CPPs
The analysis of genes involved in the biosynthesis of CPPs depended on the KEGG pathway annotation dataset. Candidate genes involved in CPPs biosynthesis were identified in KEGG database based on the metabolic pathways of starch, glucose, amino sugar, and nucleotide sugar. Neighbor joining (NJ) phylogenetic analysis of UGPase (uridine diphosphate glucose pyrophosphorylase) was performed with MEGA5 [35]. All potential transcripts annotated as GTs (glycosyltransferases) [EC: 2.4.x.y] were extracted with the help of Perl script based on the Swiss-Prot annotation and were classified by BLASTx searches against local Arabidopsis thaliana GTs database using a threshold of 1.0 -5 , which encompasses all genes encoding GTs in A. thaliana (http://www.cazy.org/e1.html).

Real-time PCR
Nine genes potentially involved in the biosynthesis of CPPs were selected for further analysis. Specific primers were designed using Primer Premier 3.0 [36]. The details regarding the genes and primers are presented in Table 1. For expression analysis, total RNA was extracted from the leaf, stem, flower bud, and root tissues at the same growth stage in 2013 using the protocol described earlier. The RNA was treated with DNase I to remove contaminating DNA (TaKaRa, Dalian, China). Following this, total RNA was reverse transcribed into cDNA by using a Prime-Script RT Master Mix (TaKaRa). All primer pairs were examined using standard real-time PCR and Premix Ex Taq (TaKaRa), and the presence of a single amplification product of the expected size for each gene was verified by electrophoresis on a 1.5% agarose gel followed by ethidium bromide staining.
Real-time PCR was performed using a SYBR Premix ExTaq Kit (TaKaRa) on an Applied Biosystems 7300 Real-Time PCR system (Applied Biosystems, Foster City, CA, USA). The total volume of the reaction mixture was 20 μL, and it contained 2 μL cDNA, 10 μL SYBR Premix Ex Taq, 0.4 μL ROX Reference Dye II, and each primer at a concentration of 4 pmol. PCR conditions were as follows: 95°C for 30 s, followed by 40 cycles of 95°C for 5 s and 60°C for 31 s. The relative expression levels of genes were calculated by 2 -ΔΔCt method [37,38]. Biological samples from three independent experiments were analyzed and each reaction was run in triplicates. Raw data on the relative abundance of each transcript were expressed as mean ± standard deviation (SD).

Results
Illumina sequencing, de novo assembly, and assessment of assembly program At the flowering and boll-forming stage, reproduction and vegetative growth concur in the species, and flower buds become gradually visible and the petals become extended (Fig. 1B). At this stage, concentrations of CPPs in the roots and flower buds are rather higher than in the other aerial parts. To cover the C. pilosula transcriptome comprehensively, total RNA was isolated from the four tissues collected at this stage as well as from the root tissue collected at other growth stages. Equal amounts of total RNA from these samples were collected and pooled. The mRNA was purified, fragmented, and reverse-transcribed into cDNA. The cDNA was subjected to Illumina HiSeq 2000 paired-end sequencing. After the removal of adaptor sequences, ambiguous reads, and low quality reads, 91,175,044 clean reads 101 bp in length, were obtained. All HiSeq 2000 sequencing data were combined and assembled using Trinity, which produced 102,125 contigs with an N50 of 999 bp (i.e., 50% of the assembled bases were incorporated into contigs of 999 bp or longer). Then, TGICL was used to reduce the redundancy. Finally, 45,511 unigenes were generated, which included 10,563 clusters and 34,948 singletons ( Table 2). The size distribution of the unigenes is shown in Fig. 2. The size distribution was homogeneous and 25.3% of unigenes were longer than 1000 bp. The longest unigene recovered in this work was Unigene CL5433.Contig2_Codonopsis, the length of which was 9,150 bp and it encoded alpha, alpha-trehalose phosphate synthase (according to Swiss-Port annotation result). All reads were deposited in the National Center for Biotechnology Information (NCBI) and can be accessed in the Short Read Archive (SRA) Sequence Database under project accession number: SRP038022.

Functional annotation and analysis of the predominant transcripts
All C. pilosula unigenes generated by HiSeq 2000 sequencing were aligned to public protein databases (NR, COG, Swiss-Prot, and KEGG) and nucleotide database (Nt) by BLASTx and BLASTn, respectively (with cut-off E-values of 10 -5 ). In total, 34,616 (76.1%) unigenes were aligned to homologous sequences in public databases (Table 3). To understand the expression profile of the assembled sequences, we aligned the high-quality clean reads to the assembled C. pilosula transcripts with the help of Bowtie program, allowing one mismatch. Following the alignments, the count of raw reads for each unigene was normalized to fragments per kilobase of exon model per million mapped fragments (FPKM) [39]. We also mapped the reads to the assembled transcriptome; 92.51% of the reads could be mapped to the C. pilosula library. The data on the mapped efficiency are presented in S1 Table. NR annotation Approximately 74.2% (33,748) of the unigenes were annotated with reference to the NR database. Based on the NR annotations, 30.3% of the annotated sequences had the highest homology (E-value < 10 -100 ) and 15% had moderate homology (10 -100 < E-value < 10 -60 ) (Fig. 3A). With respect to similarity distribution, 24.9% of the hits had higher similarity than 80%, and 46.8% had moderate similarities of approximately 60-80% (Fig. 3B). With respect to species, 38.4% of the unigenes had top matches to the sequences from Vitis vinifera, 15.8% to the sequences from Solanum lycopersicum, and 9.6% to those from Prunus persica (Fig. 3C).

GO categories
Among the 33,748 most significant BLASTx hits against the NR database, 26,189 unigenes could be assigned to one or more terms, which were categorized into 55 functional groups as shown in Fig. 4. Among the unigenes that were categorized in "the biological process," 16,286 were related to the "cellular process" and represented the largest proportion, followed by "metabolic process" (15,732), and "single-organism process" (11,743). Within the cellular components, the assignments were mostly enriched in the "cell part" (19,471) and "cell" (19,472). The GO terms for molecular function were mainly grouped into "catalytic activity" (13,164) and "binding" (12,307).

Analysis of highly expressed transcripts
The value of FPKM represents the abundance of a certain unigene in a certain gene pool, and is often used to indicate the level of gene expression in various biological samples. The top 20 relatively highly abundant genes in the dataset are listed in Table 4. Transcripts encoding chlorophyll a-b binding protein 40, ribulose-1,5-bisphosphate carboxylase small subunit, and chloroplast photosystem II 10 kDa polypeptide were abundant in the transcriptome library, indicating that the photosynthetic process was active in the tissues investigated. Metallothionein is responsible for the detoxification of heavy metals, metal ion transport, maintenance of dynamic balance, ion metabolism, and reactive oxygen removal [40]. Two transcripts (Unige-ne8032_Codonopsis and Unigene12118_Codonopsis) encoding metallothionein were also found abundantly in the tissues. Additionally, unigenes encoding aquaporin, cysteine protease, chloroplastic thiamine thiazole synthase, and dormancy-associated protein were also abundant in the cDNA library. The remaining unigenes with higher FPKM values could not be annotated successfully, indicating their unidentified roles in C. pilosula.

Predicted coding sequences
Based on three public protein databases, 36,146 CDSs were obtained, which accounted for 79.42% of the total number of unigenes. Among these, 33,804 CDSs were predicted by BLASTx and 2,342 by ESTScan, indicating relatively higher identities of the predicted CDSs to those available in the public databases. The length distribution of CDSs is shown in Fig. 7. A total of 27,562 CDSs were between 200 bp and 1000 bp in length, 68.25% of which was longer than 500 bp, thus constituting a relatively higher proportion of the predicted CDSs. The number of CDSs longer than 2000 bp was 1,563, which was approximately 18.21% of the CDSs longer than 1000 bp, indicating a relatively higher ratio of CDSs with larger size. The CDSs predicted by ESTScan tool were primarily 200 bp to 1000 bp in length.

Functional annotation and candidate genes involved in CPP biosynthesis
To understand the biosynthesis of CPPs and to identify the genes involved, the transcripts related to starch and sucrose metabolism (ko00500) as well as amino sugar and nucleotide sugar metabolism (ko00520) were analyzed [41]. Intensive studies on cell wall polysaccharides have indicated that this fraction is variable in glycan structures and is mainly composed of xyloglucans, glucomannans, and pectic polymers. Across known cases the plant polysaccharides are constitutive of nucleotide-diphospho-sugar (NDP-sugar) precursors. A class of enzymes called glycosyltransferases (GTs) catalyzes specific sugar incorporation from an NDP-sugar into a growing polysaccharide chain. The composition of cell walls is determined in part by the availability of NDP-sugars to form different types of wall polymers [42][43][44]. Since all wall polysaccharides are synthesized from activated NDP-sugars, biosynthetic pathway of CPPs was inferred and predicted.

Starch and sucrose metabolism
Among 128 metabolic pathways detected in C. pilosula, candidate genes related to starch and sucrose metabolic pathways were ranked seventh as far as the number of involved genes was concerned (S3 Table). To understand the biosynthesis of CPPs, the unigenes involved in the biosynthesis of starch and other related polymers as well as those related to carbon assimilation in the leaf tissues were annotated. The key enzymes involved in these metabolic processes were identified and annotated based on KEGG (see Table 5 and S3 Table). In plants, UGPase catalyzes the formation of uridine diphosphate glucose (UDP-Glu), a key precursor of NDP-sugars formed from 1-P-glucose (Glu-1-P) [45,46]. In Astragalus membranaceus, the enzyme activity is positively correlated with the concentration of Astragalus polysaccharides, indicating that UGPase plays an important role in polysaccharide biosynthesis [16]. In Arabidopsis thaliana, UGPase is a rate-limiting factor and contributes primarily to UDP-Glu metabolism in the vegetative phase [45]. In C. pilosula library, Unigene16117_Codonopsis (KJ470627) was identified as a homologue of UGPase. This 2087-bp unigene contained a complete open reading frame (ORF) and showed the highest amino acid identity (88%) to UGPase from V. vinifera (Accession number: XP_002282276)(S2 Fig.). In A. membranaceus, granulebound starch synthase (GBSS) is a key enzyme required for the synthesis of polysaccharides [17]. A 2308-bp Unigene7794_Codonopsis (KJ470628) was also identified as a homologue of GBSS and contained an intact ORF encoding 622 amino acids.
Thirty-seven transcripts encoding NSEs were identified in C. pilosula (Table 6, S4 Table). A comparison of FPKM values revealed that in the transcriptome library, the transcript of UGl-cAE (UDP-glucuronate 4-epimerase) was the most abundant, followed by those of UGDH (UDP-glucose 6-dehydrogenase) and AXS (UDP-apiose/xylose synthase). The protein encoded by UGlcAE is responsible for the production of UDP-GalA, using UDP-GluA as a precursor. The enzyme encoded by UGDH converts UDP-Glu into UDP-GluA, thus taking part in the biosynthesis of glycosaminoglycans. AXS converts UDP-GluA to UDP-Xyl. Relatively higher abundance of such transcripts suggested that the biological process of conversion from UDP-GluA to UDP-Xyl should be very active.

CPP biosynthetic pathway
Based on the annotation results of the abovementioned pathway and documents published [41], biosynthetic pathway for the formation of CPPs from sucrose was cautiously constructed and was shown in Fig. 8. In this scheme, synthesis of CPPs is divided into three main steps. The first step is the formation of UDP-Glu and GDP-Man from Glu-6-P and Fru-6-P, which are then converted in the second step into other NDP-sugars. Finally, the active monosaccharide units, NDP-sugars, are added to the sugar residues of various polysaccharides and glycoconjugates by the action of various GTs [48]. In C. pilosula library, 723 transcripts encoding GTs were identified based on the annotations in Arabidopsis thaliana, including 591 GTs and 132 GTNCs (referenced for glycosyltransferases with no classification) (see S5 Table). Among these GTs, the transcripts of family 1 glycosyltransferase (GT1s), often referred to as UDP glycosyltransferase (UGTs), were the most abundant, indicating that GT1s might be encoded by comparably large multigene families. The transcripts of GT2, GT8, GT4, and GT48 were more abundant than that of others, and GT31, GT20, GT41, GT5, and GT90 showed moderate abundance, whereas the remaining had only limited number of transcripts (less than ten in number). An imbalance in the number of transcripts was also observed in annotated UGTs. Since these genes have specific functions in the biosynthesis of CPPs, the differing abundances are indicative of a preference for the synthesis of specific NDP-sugars.
In addition to polysaccharides, UGTs are also responsible for the transfer of sugar moieties from UDP sugars to other plant natural products, such as phenolic compounds, cyanins, steroids, flavonoids, and terpenes [49]. In general, approximately 48% of UGTs contain a common motif known as plant secondary product glycosyltransferase (PSPG) box [50]. As key sugar donors, NDP-sugars are primarily attached to PSPG motif of UGTs. Therefore, the PSPG motif is very important for UGTs. This motif contains 44 conserved amino acids. The amino acids at C-terminus are glutamine (Q) and histidine (H), which are highly conserved in glycosyltransferase and galactosyltransferase respectively [50]. An analysis of the amino acid sequences of PSPG in the annotated UGTs of C. pilosula showed that the terminal amino acid residue of all annotated UGTs was Q except in unigene2366, whose terminal amino acid residue was H, indicating an obvious imbalance in the transfer of NDP-sugars.
Amino acid sequences after multiple sequence alignments reflect their PSPG structures and their functions. Therefore, amino acid sequence analysis of this motif will provide insights into their functional diversities in CPPs biosynthesis. Conservative analysis on the amino acid sequences of fifty UGTs possessing a typical PSPG sequence was performed using WebLogo (http://weblogo.berkeley.edu/logo.cgi) and is presented in Fig. 9. The font size represented the conservative properties of amino acid residues, the bigger the font the more conservative amino acid residue. It indicates that the first and twenty-second tryptophan (W), fourth and forty-fourth glutamine (Q), tenth and nineteenth histidine (H), twenty-first glycine (G), twenty-fourth serine (S), twenty-seventh glutamic acid (E), and thirty-fourth and thirty-ninth proline (P) residues at N-terminus are more conserved than other amino acids in the motif. Studies on the structure-activity relationship of UGT have demonstrated that the nucleotide sugar donor primarily interacts with the C-terminal domains and that the acceptor mainly binds to the N-terminal domains [50]. Since diverse natural products act as acceptors of UGTs, further studies are needed to understand whether specific UGTs are involved in CPPs biosynthesis.

Real-time PCR analysis
Analysis of the abundance of transcripts of enzymes related to CPP biosynthesis provides evidence at two levels. On one hand, the detection of PCR products verifies if the sequences obtained by high throughput screening are reliable and accurate. On the other hand, it provides insights into their functions in different branches of CPPs biosynthesis. Based on the preliminary results of the selection of reference genes in C. pilosula (see S6 Table), GAPDH was used as internal reference in the present study. Nine candidate unigenes were selected based on the alignment results from online BLAST for verification. Among these candidates, manA (Uni-gene186) had the lowest identity to its homolog in the dataset (66% to 67%). Relatively lower identity of UXE (Unigene 9227) to its homolog was also obtained, but the value was larger than 79% (V. vinifera, XP_002264946). For following candidates: manB (Unigene11590), UGPase (Unigene 16117), RHM (Unigene 7955), UER (Unigene 12223), and UGlcAE (Unigene 15345), their identities to the corresponding homolog ranged from 81% to 93%. UGDH (CL5269.Contig 2) and AXS (Unigene 14947) shared rather higher identities (equal or larger than 90%) to their homologs available in the public dataset. The detailed results of the alignment were listed in the S7 Table. The analysis primarily confirmed that the selected potential unigenes are rather conservative and should be responsible for the synthesis of CPPs. Among these candidate unigenes, the ones encoding UXE (UDP-arabinose 4-epimerase), UER (3,5-epimerase/4-reductase), manA (mannose-6-phosphate isomerase), and manB (phosphomannomutase) should function downstream of the predicted pathway of CPPs biosynthesis (see Fig. 8), whereas the ones encoding UGDH (UDP-glucose 6-dehydrogenase), RHM (UDP-glucose 4,6-dehydratase), and UGPase should act upstream. The predicted pathway also indicated that the unigenes responsible for UGlcAE (UDP-glucuronate 4-epimerase) and AXS (UDP-apiose/xylose synthase) should possess the same precursor and work at two branch points. Real-time PCR was carried out using equally mixed total RNA extracted from all tissues collected at flowering and boll-forming stage as a template for transcription, and PCR products were detected by 1.5% agarose gel electrophoresis. The result was presented in S3 Fig. All unigenes investigated were successfully amplified, confirming the reliability and high accuracy of the used deep sequencing technique.
The results of the relative expression of the nine candidate genes in the four tissues are presented in Fig. 10. Among these, the expression of ManA was relatively stable and there was no significant difference among the tissues. A similar expression pattern of UGPase was also observed. Relatively lower expressions in leaves and relatively higher expressions in stem and root tissues were observed for manB, RHM, UER, UGDH, UGlcAE, and AXS, indicating that the expression of these candidate genes were tissue-dependent. Comparatively, the expression of UXE in roots was much higher than the expression in other tissues, indicating an important role the gene might play in the biosynthesis of CPPs in the species.
Besides real-time PCR, another dataset on the abundance of the transcripts of these unigenes was obtained by RNA-seq (S8 Table). A comparison between the two datasets was performed and was presented in Fig. 11. In the figure, the data on relative abundance of the transcripts obtained by real-time PCR were presented as a column graph, while the FPKM value of each transcript was indicated with a broken line. It revealed that there was some disagreement in the two datasets of each unigene even in the same tissue excluding UGPase. For example, the data from the two datasets on manA in roots, and RHM and UGlcAE in stems were not in accordance, indicating bias in the datasets obtained by the two methods. In general, similar patterns of gene expression in the same tissue and at the same growth stage should be observed by both techniques. In the present work, two sets of samples were collected from plants at the same growth stage but not in the same year (2012 and 2013). Inconsistent expression patterns might be caused by different environmental conditions or stress factors. The comparison between the two datasets also demonstrated that the expression of UGPase is relatively stable, suggesting conservation and vital role of the enzyme in the process of CPPs biosynthesis. Due to the higher sensitivity of Illumina sequencing compared to real-time PCR, further investigation is needed to understand their functions at a molecular level [51].

Discussion
To our knowledge, this is the first de novo sequencing of samples from C. pilosula using Illumina HiSeq 2000 platform. The sequencing produced 91,175,044 clean reads and 45,511 unigenes after assembly and made it possible to understand the biosynthesis of CPPs at molecular level. Compared to datasets published on medicinal plants, the number of unigenes obtained from C. pilosula was lower than that obtained from Polygonum cuspidatum (86,418), Dimocarpus longan (68,925), and Hypericum perforatum (59,184), but was higher than that from Siraitia grosvenorii (43,891) [21,24,52,53]. However, the average length of unigenes obtained here was 728 bp, which was longer than that from other medicinal plants, including P. cuspidatum  (422 bp). A relatively higher annotation rate (76.1%) was also observed in the present work. Notably, the number of unigenes longer than 1000-bp was 8,584, accounting for 23.75% of all unigenes, indicating higher quality of the transcriptome library and the reliability of Illumina Hiseq-2000 technology for global screening of functional genes in non-model plants, especially medicinal plants.
Using bioinformatics tools, we identified unigenes involved in CPPs biosynthesis. Annotation results suggested that sucrose is used as the primary starting material for CPPs biosynthesis. Further analysis showed relatively higher abundance of CPPs biosynthesis-related transcripts in the root and stem tissues, which might be related to the functions of the studied tissues and might be beneficial to the synthesis and accumulation of polysaccharide, the main active component in the plant. The real-time PCR analysis showed that the transcripts of genes related to the conversion of UDP-glucose to UDP-L-arabinose (branch pathway) was more abundant than those of the others, suggesting that this pathway functions as a key branch of CPPs biosynthesis at the flowering and boll-forming stage of C. pilosula. However, a previous study on C. pilosula collected during traditional harvest season showed that the content of arabinose was not the highest and was substantially lower than that of glucose and galacturonic acid [12]. NSEs are responsible for the derivatization of most NDP-sugars. The abundance of individual NSEs is associated with the activity of the enzyme and the terminal product. Consistent with a previous study, which found that xylose was an important unit in CPPs biosynthesis and that its molar ratio (15.41) was higher than that of galactose (2.87), glucose (1.92), and rhamnose (12) [54], our results showed higher abundance of AXS transcript in the C. pilosula library. Further studies are needed to identify the diverse molecular entities used for the biosynthesis of CPPs.
UGPase is a key enzyme that functions upstream of the biosynthesis pathway of Astragalus polysaccharides, and is responsible for the conversion of glucose-1-phosphate to UDP-Glu. Although there was a positive correlation between the activity of UGPase and the polysaccharide content in A. membranaceus, higher abundance of the corresponding transcript was not observed in our study [16]. Alignment on UGPase homolog from representative species in plant kingdom showed that Unigene16117 was more similar to UGPase from V. vinifera (Accession number: XM_002282240) than to UGPase from A. membranaceus (AF281081) both at nucleotide and amino acid levels. Phylogenetic analysis showed that Unigene16117 and UGPase from Vitis vinifera are clustered in the same subclass and that the homologues from C. pilosula and A. membranaceus are located relatively far from each other (see S2 Fig.). Therefore, the low abundance of Unigene16117 may be related to the relatively low content of CPPs in the raw materials investigated or may have been caused by divergent function of the protein. However, consistent abundance pattern of the transcript in all tissues examined across two growing seasons confirm the conservation and importance of the enzyme in the biosynthesis of CPPs. Further studies are needed to understand the relationships between enzyme activity, transcript abundance, and the content of CPPs in C. pilosula.
UDP-glucose dehydrogenase (UGDH) catalyzes the oxidation of UDP-glucose (UDP-Glc) to UDP-glucuronate (UDP-GlcA), a key sugar nucleotide involved in the biosynthesis of plant cell wall polysaccharides. Analysis on endogenous expression of a UGDH in Eucalyptus grandis (GenBank accession number: EF179384) showed that it was mainly expressed in roots, stem and bark of 6-month-old saplings, with lower expression level in leaves [55]. In C. pilosula, a relatively higher expression of UGDH-like gene was observed in stem and root tissues, demonstrating similar expression pattern to its homologue from E. grandis. Both works carried out in two distantly related plants also indicated the conservation of UGDH homology's function in plant kingdom. L-Rhamnose (Rha) is an important constituent of CPPs, which is synthesized by the enzyme encoded by AtRHM1 in Arabdopsis [56]. Expression studies with AtRHM1 promoter-GUS fusion gene showed that AtRHM1 is expressed almost ubiquitously, with stronger expression in roots and cotyledons of young seedlings and inflorescences. The most abundant transcript of RAH1-like unigene was detected in roots, which was consistent with AtRHM1 in Arabidopsis. Thus, the studies on UGDH and RHM in Arabidopsis confirmed that their homology in C. pilosula should act in a similar way in the biosynthesis of CPPs, and that they should be functional genes involved in the biosynthesis of CPPs. Although UGPase, UGDH, and RHM function in parallel in separated branches of biosynthesis pathway of CPPs, the expression patterns of the latter two were more similar to each other, possessing relatively higher expression in stems and roots, indicating that the transcription of them might be regulated in a similar molecular mechanism. More work should be needed to understand the mechanism and regulation of biosynthesis and accumulation of CPPs at molecular level.
To summarize, an EST collection was created from a transcriptome library prepared from different parts of C. pilosula. The results demonstrated the usefulness of this methodology for the identification of metabolic pathways related to polysaccharide biosynthesis in medicinal plants. Our results provide understanding of the biosynthesis of CPPs at the molecular level in C. pilosula inferred by real-time PCR analysis of candidate unigenes functioning at key separate branches of biosynthesis pathway of CPPs. These findings will allow for an efficient and sustainable production of CPPs and related bioactive natural products.