De Novo Transcriptome Assembly and Annotation of the Leaves and Callus of Cyclocarya Paliurus (Bata1) Iljinskaja

Cyclocarya Paliurus (Bata1) Iljinskaja contains various bioactive secondary metabolites especially in leaves, such as triterpenes, flavonoids, polysaccharides and alkaloids, and its leaves are widely used as an hyperglycemic tea in China. In the present paper, we sequenced the transcriptome of the leaves and callus of Cyclocarya Paliurus using Illumina Hiseq 4000 platform. After sequencing and de novo assembly, a total of 65,654 unigenes were generated with an N50 length of 1,244bp. Among them, 35,041 (53.37%) unigenes were annotated in NCBI Non-Redundant database, 19,453 (29.63%) unigenes were classified into Gene Ontology (GO) database, and 7,259 (11.06%) unigenes were assigned to Clusters of Orthologous Group (COG) categories. Furthermore, 11,697 (17.81%) unigenes were mapped onto 335 pathways in Kyoto Encyclopedia of Genes and Genomes (KEGG), among which 1,312 unigenes were identified to be involved in biosynthesis of secondary metabolites. In addition, a total of 11,247 putative simple sequence repeats (SSRs) were detected. This transcriptome dataset provides a comprehensive sequence resource for gene expression profiling, genetic diversity, evolution and further molecular genetics research on Cyclocarya Paliurus.


Introduction
Cyclocarya Paliurus (Bata1) Iljinskaja, a unique genus of Juglangdaceae, is a well-known edible and medicinal plant growing in southern China. The leaves of Cyclocarya Paliurus have been often used to produce teas with health benefits, which is named "sweat tea" for its slight sweetness [1][2][3][4]. It has been demonstrated that C. Paliurus exhibits various pharmacological activities such as anti-hypertensive, hypoglycemic, antioxidant and enhancement of mental efficiency [5][6][7], which may be mostly attributed to its various bioactive components, e.g. flavonoids [4], triterpenoids [8], polysaccharides [9] and polyphenols [10]. Among these compounds, triterpenoids are an important group of health-promoting chemicals. In our previous 6-Furfurylamino-purine (KT 1.0 mg/L) and cultured under a 12/12 h (light/dark) photoperiod. The subculture interval was initially 20 days, then gradually decreased to 10 days with the increase of subculture time [32]. The collected samples were immediately frozen in liquid nitrogen and stored at -80°C. Total RNA of each sample were extracted and purified using TRIzol 1 reagent (Plant RNA purification reagent, Invitrogen, Carlsbad, CA, USA) according the manufacturer's instructions. The RNA concentration and purity were detected by Nanodrop 2000 (Thermo Fisher, America), and the quality of RNA was further verified by gel electrophoresis and Agilent 2100. Only high-quality RNA samples (OD 260/280 = 1.8~2.2, OD260/2302.0, RIN6.5, 28S:18S1.0, >10μg) were used to construct sequencing library.
cDNA library construction and Illumina sequencing RNA-seq transcriptome library was prepared from 5μg of total RNA using TruSeq TM RNA sample preparation kit from Illumina (San Diego, CA). The poly(A) mRNA was isolated from total RNA using Oligo (dT) magnetic beads. Following purification, the mRNA was randomly cleaved into short fragments (100 to 400 bp) after adding fragmentation buffer. These short fragments were used as templates to synthesize the first-strand cDNA using reverse transcriptase and random primers. The second-strand was synthesized subsequently using a SuperScript double-stranded cDNA synthesis kit (Invitrogen, CA) with random hexamer primers (Illumina). The cDNA fragments were purified and resolved with EB buffer for end repair and Atailing addition, and then connected with paired-end adapters. After PCR amplification using Phusion DNA polymerase (NEB) for 15 PCR cycles, the cDNA target fragments of 200-300 bp were size-selected to establish the cDNA library on 2% Low Range Ultra Agarose. After quantification by TBS380, the paired-end RNA-seq sequencing library was sequenced from the 5' to 3' ends using an Illumina Hiseq 4000 platform with the 2×151 bp paired-end read module.

Data filtering and De novo assembly
The raw paired-end reads, which were transformed by the Base Calling into sequence data, were cleaned to high-quality reads by removing the joint sequences and adaptor sequences, reads containing more than 10% N rate, and trimming the low-quality reads (quality value < 20) from the 3' end of the sequence and the raw reads with an average length less than 30bp. Then, de novo transcriptome assembly was conducted using software Trinity (http:// trinityrnaseq.sourceforge.net/, Version number: trinityrnaseq, release-20140413) without reference genome [33], Trinity consists of three software modules: Inchworm, Chrysalis and Butterfly, and has been regarded as the authoritative software for the efficient and robust de novo reconstruction of transcriptome.
De novo assembly was carried out according to the established method [33], which was briefly described as follows. Firstly, linear transcript contigs were efficiently reconstructed by Inchworm in the following seven steps: (1) Constructing a k-mer dictionary (k = 25) from all sequence reads; (2) Removing likely error-containing k-mers from the dictionary; (3) Selecting the most frequent k-mer to seed a contig assembly (excluding both low-complexity and singleton k-mers); (4) Extending the seed in each direction with the highest occurring k-mer of a k-1 overlap; (5) Extending the sequence in either direction until it cannot be extended further; (6) reporting the linear contig; (7) Repeating steps three to six with the next most abundant k-mer until the entire k-mer dictionary has been exhausted. Secondly, Inchworm contigs were recursively grouped into connected components by Chrysalis. Contigs with a perfect overlap of k-1 bases or a minimal number of reads that span the junction across both contigs were deemed to be derived from the same gene and clustered into the same group. After grouping, complete de Bruijn graphs were constructed for each component. Thirdly, Butterfly processed the individual graphs independently, and extracted full-length isoforms and teased apart transcripts derived from paralogous genes. Redundant sequences were eliminated, and the longest transcript that could not be extended on either end was defined as unigenes. The assembled unigenes (longer than 200 bp) had been deposited into the NCBI Transcriptome Shotgun Assembly Sequence Database (http://www.ncbi.nlm.nih.gov/genbank/tsa/) with the accession numbers (GEUI00000000).

Functional annotation and classification
All assembled unigenes were aligned with BLASTX program for homology searches against publicly available protein databases Non-redundant (http://www.ncbi.nlm.nih.gov/), Swissprot (http://www.ebi.ac.uk/uniprot/), Pfam (http://pfam.sanger.ac.uk/), String(http://string-db.org/) with identity set at >30% and a cutoff E-value of 10 −5 , and annotated and classified on Gene Ontology (http://www.geneontology.org/), Clusters of Orthologous Group (http://www.ncbi. nlm.nih.gov/COG/), and the KEGG pathway (http://www.genome.jp/kegg/) with a threshold E-value of 10 −5 . The aligning results were used to identify the sequence direction and to predict the coding regions. If the aligning results from different databases were conflicted with each other, a priority order of alignments from Nr, SwissPort, KEGG, GO and COG was followed. Based on the Nr annotations, the Blast2GO program was used to obtain GO annotations according to biological process, molecular function and cellular component [34]. The unigenes were also aligned to the COG database to predict and classify functions, and the secondary metabolic pathways were annotated according to the KEGG pathway database. Transcription Factors of C. Paliurus were extracted from Plant Transcription Factor Database (PlantTFDB), and unigenes were mapped to them using Blastn program.

Detection of SSR markers
The unigenes were scanned for microsatellites using the MISA software (http://pgrc. ipkgatersleben.de/misa/) with the default parameters. The parameters were adjusted for identification of perfect di-nucleotide, tri-nucleotide, tetra-nucleotide, penta-nucleotide, and hexanucleotide motifs with a minimum of 6, 5, 5, 5, and 5 repeats, respectively. Primer pairs were designed using Primer 3.0.

RNA sequencing and de novo assembly
To generate a comprehensive overview of C. Paliurus transcriptome, total RNA were extracted from leaves and callus, then the mRNA was isolated, and cDNA libraries were established and sequenced separately using Illumina Hiseq 4000 platform, which generated 39.0 and 49.4 million raw reads, respectively (Table 1). After removing adaptor sequences, ambiguous reads and low-quality reads, the quality of reads was assessed successively. The clean reads were individually generated with the average GC percentage of 46.49% and 47.90% (Table 1). By using the Trinity program [1], all high-quality reads were assembled into 65,654 unigenes with an N50 of 1,244 bp and average length of 704 bp, and 84,223 transcripts were constructed with an N50 of 1,362 bp and average length of 792 bp ( Table 2). The average GC content of the unigenes was 42.95%. Furthermore, the length of these unigenes ranged from 201 to 10,000 bp (Fig 1), and the majority were disturbed in 201-400bp. However, there are still 14,155 unigenes (21.56%) whose lengths were more than 1,000bp. These data indicated that the generated unigenes in our experiments were of fine quality and therefore suitable for further annotation.

Metabolic pathway analysis by KEGG
The KEGG pathway database provides a wealth of information on molecular interaction and reaction networks t o further understand the biological functions of unigenes. The mapped results indicated that 11,697 (17.81%) unigenes were predominantly annotated with Enzyme Commission (EC) numbers and divided into five branches according to the metabolic pathway ( Fig 5) (Metabolism; Genetic Information Processing; Environmental Information Processing; Cellular Processes; Organismal Systems), and further grouped into 335 KEGG pathways (S2 Table). It was noteworthy that 7,746 (66.34%) mapped unigenes participated in the metabolism and 1,312 (11.24%) were involved in the biosynthesis of secondary metabolites such as phenylpropanoid biosynthesis (197), flavonoid biosynthesis (42), terpenoid backbone biosynthesis (58), which were the important information we are especially interested in. Furthermore, the top 20 largest annotated pathway groups of C. Paliurus were presented in Fig 6. The most representative KEGG pathway was "Ribosome", followed by "Plant hormone signal transduction" (328), "Protein processing in endoplasmic reticulum" (301), "RNA transport" (299), and "Plant pathogen interaction" (280). As shown in Fig 6, there were 328 (2.81%) unigenes mapped into the "Plant hormone signal transduction" pathway, some of which might be stimulated to express by the supplemented plant hormones in the callus culture medium. In our experiments, plant hormones such as rootone, indole acetic acid, and cytokinins were added into the callus culture medium, which have been presumed to participate in the regulation of "Plant growth", "Cell culture", "Cell differentiation", and "Recession". These predicted results indicated that numerous unigenes were involved in the secondary metabolite biosynthesis of C. Paliurus leaves and callus. The above mapped information with KEGG pathway database would be very useful for future researches on gene function and its regulatory mechanism of C. Paliurus.

Transcription factor analysis
Transcription factors (TFs) play critical roles in plant growth, bioactive component synthesis and gene expression regulation, especially in the secondary metabolism regulation. Plants show various TFs expression patterns when growing in different environments or facing stress, which further significantly affect the synthesis of secondary metabolites [38][39][40]. Therefore, the identification of putative TF genes is useful for understanding the regulatory mechanism of secondary metabolites. TFs are often classified into different families according to the features of DNA-binding domains. In this study, a total of 21,843 unigenes were annotated and further classified into 60 transcription factor families (Plant Transcription Factor Database, PlantTFDB, http://planttfdb.cbi.pku.edu.cn) in this paper (Fig 7). Among these TF families, the most abundant transcription factors of C. Paliurus were found in the bHLH family which includes 2,120 unigenes. The second was NAC, followed by the bZIP, MYB-related, WRKY, C3H, B3 and C2H2 TF families, which contains 1,734, 1,422, 1,311, 1,071, 1,036, 1,007 and 990 unigenes, respectively. Researches have validated that the bHLH, WRKY, MYB, bZIP, and C2H2 TF families play a major role in the regulation of many genes which participate in the plant secondary metabolism [41,42], especially for the regulation of the bioactive component synthesis, such as flavonoids [43], alkaloids [44,45], and terpenoids [46]. The expression level of these TFs in C. Paliurus may be associated with the biosynthesis of secondary metabolites, and the discovery of these putative TFs may provide valuable information for the future researches on gene expression regulation, particularly those TFs related to the flavonoid pathway which will be described below.

Analysis of flavonoid biosynthesis pathway
Flavonoids are polyphenolic secondary metabolites derived from the phenylalanine via the phenylpropanoid pathway, which become a research focus in recent years for their various biological and pharmacological activities [47]. It was reported that the total flavonoid content of C. Paliurus leaves ranged from 0.73 to 4.73%, depending on the growing region, harvest time, determination method and so on [48][49][50]. The most abundant flavonoid was isoquercitrin in the C. Paliurus leaves [12]. Up to now, sixteen flavonoids have been isolated and identified from C. Paliurus (Table 4). In the present study, 197 unigenes were found to be involved in phenylpropanoid pathway by mapping with KEGG pathway database. Among them, 42, 3, 1 and 3 unigenes were involved in the biosynthesis of flavonoids, anthocyanin, isoflavonid and flavone and flavonol, respectively, which represented different enzymes in the different pathways. The unigenes associated with flavonoid biosynthetic pathway were shown in the Fig 8. A total of 10 candidate genes with annotations matching enzymes in the flavonoid biosynthesis, i.e., phenylalanine ammonialyase (PAL), cinnamate-4-hydroxylase (C4H), 4-coumaroyl: Moreover, the unique putative unigenes encoding anthocyanidin synthase (ANS), leucoanthocyanidin reductase (LAR), anthocyanidin reductase (ANR) and anthocyanidin 3-O-glucosyltransferase (3GT) were also prominently found. It's important to clone these unique genes and further analyze their functions in the future studies. Isoflavonoids, a subclass of flavonoids with special structure and function, have been normally found in leguminous plants. It's interesting that one isoflavonoid 7-hydroxyl-4´-methoxy isoflavone was identified in C. Paliurus leaves [51]. However, 2-hydroxyisoflavanone synthase (IFS), which catalyzes the conversion of flavanones to isoflavones, hasn't been found in our mapping experiments. A further study is needed in the future. Transcription factors play an important role in flavonoid biosynthesis. These factors regulate some genes co-expression to stimulate or inhibit the accumulation of flavonoids when combining with the special structure gene function. So far, researchers have isolated and MYC (bHLH) is one of the biggest transcription factor families in plant. This family regulates not only the plant growth and developmental processes including formation of trichome and light signal transduction, but also the stress responses and secondary metabolism [56,57]. Plant bHLH proteins have been classified into 32 subfamilies according to genome-wild classification and the evolutionary analysis [58], whereas members of the same plant bHLH subfamilies had similar functional characteristic [59]. There were some bHLH subfamilies involved in the regulation of phenylpropanoid and terpenoids biosynthesis pathways, such as AtTT8 in Arabidopsis [60], OsRa-c in rice [61] and TcJAMYC in yew [46]. Furthermore, the above mentioned subfamilies have also been found to be related to the regulation of anthocyanin biosynthesis in various plants, for example, Zea mays L, Oryza sativa L, Dahlia variabilis Hort.and Brassica oleracea L [62][63][64][65].
It was reported that there were approximately 197 and 155 MYB genes in Arabidopsis and rice, respectively [66]. The first plant MYB gene, C1, encodes a MYB-like TF, was isolated from Zea mays, which regulated the synthesis of anthocyanin with the synergy of R/B proten family   [68]. According to Grotewold et al [69], maize P1 (R2R3-MYB transcription factor) improved the expression of CHS (chalcone synthase), CHI (chalcone isomerase), DFR(dihydroflavonol 4-reductase), but didn't stimulate the expression of branching enzyme such as F3H (flavanone 3-hydroxylase) in anthocyanin biosynthesis. However, it has been reported that many R2R3-MYB transcription factors, e.g., AN2 in Petunia hybrid [70] and PAP1/PAP2 in Arabidopsis thaliana [53], were involved in the regulation of anthocyanin biosynthesis. In our analysis, a total of 711 MYB genes were found in C. Paliurus. Therefore, more study remains to be done to identify the MYB genes from C. Paliurusfor studying the regulation of flavonoid biosynthesis.
In the present paper, we found many transcription factors involved in flavonoid metabolism of C. Paliurus, providing critical information for further regulation study on the gene expression and secondary metabolism. We have established the cell suspension culture of C. Paliurus to produce the bioactive components such as flavonoids, triterpenoids and polysaccharides, and the above mentioned transcription factor information will be helpful to further develop and improve our current approach to achieve a higher yield of these compounds by gene expression regulation. Therefore, we can use genetic engineering for improving plant flavonoid secondary metabolic pathways to effectively increase the content of secondary metabolites by studying the transcription factors involved in flavonoid secondary metabolism, and understanding the mechanisms of plant secondary metabolic regulation.

Detection of simple sequence repeats (SSRs)
SSRs, or microsatellites are important molecular markers for genetics and biology researches, including gene mapping, genetic diversity assessment, comparative genomics, and molecular breeding. In this paper, 65,654 assembled unigene sequences from C. Paliurus were scanned to explore the SSR profiles using MISA software and the results were shown in Table 5. A total of 9,688 sequences containing 11,247 SSRs were identified. Of all 9,688 SSR containing sequences, 1,353 had more than one SSR. In addition, 494 SSRs were present in compound forms. Among these SSRs, dinucleotide repeat motifs (5,198,46.21%) were the most abundant, followed by AAG/CTT, AC/GT and ACC/GGT repeat (Table 6). A total of 5341 primer pairs, which contain three sets of primers respectively, were designed from 9,688 sequences using Primer 3 (S3 Table). These results provided plenty of reliable markers for genetic linkage mapping, analysis of genetic polymorphism and functional gene mining of C. Paliurus and its closely related species.

Conclusions
C. Paliurus is a well-known edible and medicinal plant, but no genomic information is available yet. In this paper, the transcriptome of C. Paliurus leaves and callus without a reference genome was analyzed using Illumina Hiseq 4000 platform. A total of 65,654 assembled unigenes were generated. The annotated unigenes were functionally classified in the GO, COG, and KEGG databases. Moreover, the putative simple sequence repeats (SSRs) were detected. To our knowledge, this is the first attempt to de novo assemble the whole transcriptome of C. Paliurus. This study provided not only a comprehensive enough coverage for gene cloning, expression, and functional analysis, but also a valuable public platform to understand the biosynthesis and regulation of secondary metabolites in C. Paliurus, especially those important bioactive components.
Supporting Information S1  Transcriptome Assembly and Annotation of Cyclocarya Paliurus