De novo transcriptomic analysis of leaf and fruit tissue of Cornus officinalis using Illumina platform

Cornus officinalis is one of the most widely used medicinal plants in China and other East Asian countries to cure diseases such as liver, kidney, cardiovascular diseases and frequent urination for thousands of years. It is a Level 3 protected species, and is one of the 42 national key protected wild species of animals and plants in China. However, the genetics and molecular biology of C. officinalis are poorly understood, which has hindered research on the molecular mechanism of its metabolism and utilization. Hence, enriching its genomic data and information is very important. In recent years, the fast-growing technology of next generation sequencing has provided an effective path to gain genomic information from nonmodel species. This study is the first to explore the leaf and fruit tissue transcriptome of C. officinalis using the Illumina HiSeq 4000 platform. A total of 57,954,134 and 60,971,652 clean reads from leaf and fruit were acquired, respectively (GenBank number SRP115440). The pooled reads from all two libraries were assembled into 56,392 unigenes with an average length 856 bp. Among these, 41,146 unigenes matched with sequences in the NCBI nonredundant protein database. The Gene Ontology database assigned 24,336 unigenes with biological process (83.26%), cellular components (53.58%), and molecular function (83.93%). In addition, 10,808 unigenes were assigned a KOG functional classification by the KOG database. Searching against the KEGG pathway database indicated that 18,435 unigenes were mapped to 371 KEGG pathways. Moreover, the edgeR database identified 4,585 significant differentially expressed genes (DEGs), of which 1,392 were up-regulated and 3,193 were down-regulated in fruit tissue compared with leaf tissue. Finally, we explored 581 transcription factors with 50 transcription factor gene families. Most DEGs and transcription factors were related to terpene biosynthesis and secondary metabolic regulation. This study not only represented the first de novo transcriptomic analysis of C. officinalis but also provided fundamental information on its genes and biosynthetic pathway. These findings will help us explore the molecular metabolism mechanism of terpene biosynthesis in C. officinalis.

Introduction Henan Province on 20 October 2016. All samples were immediately frozen in liquid nitrogen and stored at -80˚C until RNA extraction.

RNA extraction and cDNA library preparation
Total RNA from approximately 80 mg of frozen tissue of leaves (coYP) and fruits (coGS) was extracted using the TRIzol 1 Reagent (Invitrogen, USA) according to manufacturer's protocol. RNA quality was assessed by NanoDrop™2000 spectrophotometer (NanoDrop Technologies, USA). All RNA extracts showed a 260/280 nm ratio of 1.8 to 2.2. Approximately 1 μg total RNA of no less than 50 ng/μL concentration, was used for RNA-seq library construction. The cDNA library was constructed using Truseq™ RNA sample preparation Kit (Illumina, San Diego, CA, USA) following the manufacturer's protocol. Poly-(A) mRNA was isolated from total RNA through Oligo-(dT) magnetic beads, and then fragmented in fragmentation buffer. The mRNA was randomly cut into 200 bp segments. The first strand cDNA was synthesized using short fragments, whereas the second strand cDNA was synthesized by second strand synthesis mix. The cDNA was repaired with End Repair Mix and addition of "A" base adaptors were ligated to the cDNA molecules.

RNA Illumina Hiseq sequencing
The cDNA library was enriched using 15 cycles of PCR. Subsequently, the target band was recycled using 2% low range ultra-agarose, and quantified by TBS 380 Mini-Fluorometer. The suitable libraries were sequenced using Illumina HiSeq 4000 SBS Kit (300 cycles, Illumina, San Diego, CA, USA) at Shanghai Majorbio Bio-pharm Biotechnology Co., Ltd. (Shanghai, China). The raw sequence reads from the Illumina sequencing were deposited in the NCBI Sequence Read Archive (SRA).

De novo assembly
First, raw reads with adaptor were trimmed, vector contaminated and low quality reads (base Q value < 20) were discarded. Second, whole reads with a base Q value < 10 were discarded. Lastly, reads with "N" bases and lengths below 20 were removed. All the above mentioned processes were carried out using the SeqPrep (https://github.com/jstjohn/SeqPrep) and Sickle (https://github.com/najoshi/sickle) [28]. Because of the absence of a reference genome, the high quality reads of leaves and fruits of C. officinalis were used for de novo assembly using the Trinity (http://trinityrnaseq.sourceforge.net/) [28].

Expression analysis
After assembly, high quality RNA-seq reads were used to determine alignment counts and quantify transcript abundance using the RSEM package (RNA-seq by Expectation Maximizattion, http://www.biomedsearch.com/nih/RSEM-accurate-transcript-quantification-from/ 21816040.html) using minimum and maximum fragment lengths of 200 and 300bp, respectively [35]. RSEM can be used to calculate the number of RNA reads or fragments mapped to unigenes based on FPKM (fragments Per Kilobase per Million) values [36].
DEGs between leaves and fruit tissue of C. officinalis were identified using the R Bioconductor package edgeR (http://www.bioconductor.org/packages/2.12/bioc/html/edgeR.html) was used [37]. The FDR (false discovery rate) is used to determine p-valuethresholds in multiple testing [38,39]. The significance of DEGs were determined based on a threshold of FDR<0.05 and absolute value of log 2 fold change !1 (log 2 fc !1). DEGs were subsequently mapped to the database for pathway enrichment analysis [21].

Illumina sequencing and de novo assembly
The cDNA library was constructed from the fruits and leaves of C. officinalis using TruseqTM RNA sample preparation kit. Subsequently, the high-quality library was sequenced using the Illumina transcriptome platform. Since the genome of C. officinalis is not available, we performed de novo assembly of the transcripts. We obtained 62,026,372 and 59,147,826 raw reads for the fruit (coGS) and leaf (coYP), respectively, which consist of 9,365,982,172 bp and 8,931,321,726 bp. For the quality control of raw reads, we have calculated the composition (S1A and S1B Fig), quality (S1C and S1D Fig), and error rates (S1E and S1F Fig) of A/T /C/G bases in the raw reads. Results showed that the reads meet the quality needed for further analysis. After eliminating random primer and adapter sequences and removing low-quality reads and short sequences of less than 20 bp, we obtained 60,971,652 and 57,954,134 clean reads of fruit coGS and coYP, which contained 9,044,801,270 and 8,579,879,557 bp, respecttively. In addition, the Q20% and Q30% were 98.08 and 97.94 for fruit and 93.97 and 93.58 for leaf, respectively. Moreover, the GC content of coGS and coYP was 45.98% and 46.54%, respectively. The sequences can be found in NCBI SRA with the accession number SRP 115440. For the assembly, a total of 56,392 unigenes consisting of 48,264,743 bp have been acquired. The percentage GC of all the unigenes was 43.74%, and the average length of each unigene was 856 bp. The largest and smallest unigenes were 124,880 and 201 bp in length, respectively. The N50 was 1445 bp (Table 1).

Functional annotation of unigenes
For functional annotation, the unigenes were searched against public databases NR, GO, and KEGG using BlastX, with an E-value cut-off set to 1x10 -5 . Different unigenes have been matched in different databases, for example, 41,146 unigenes were matched in NR database.

Expression profiling of unigenes
To explore the differential expression of unigenes in coGS and coYP of C. officinalis, clean reads from every tissue library were mapped to our unigene database through the RSEM package. Results showed that 50,029,300 reads were aligned in fruits, which was 82.05% of original clean reads (60,971,652), and 48,063,854 reads were aligned in leaves, which was 82.93% of original clean reads (57,954,134). Using the RSEM software, the number of reads or fragments mapped to unigenes and the expression levels of unigenes in fruit and leaf were calculated based on FPKM method. S8 Table demonstrates that different unigenes could be supported by different numbers of reads in fruit and leaf. For example, 958 and 2373 reads were aligned in unigene c70915_g1in coGS and coYP, respectively (S8 Table) In addition, we also determined the FPKM expression quantity of every gene in coGS and coYP (S9 Table), as well as the FPKM expression density scatter gram in coGS and coYP (S2 Fig). Moreover, S2 Fig. indicates all the gene expression density distribution in coGS and coYP. Log2FPKM4-5 was the most concentrated area of gene expression quantity.

Identification and analysis of differentially expressed genes (DEGs)
Using edgeR, we studied the DEGs in coGS and coYP based on FDR, which is a statistical method to test the correction for comparisons. To further explore characteristic upregulated and down-regulated expressed genes in the coYP versus coGS, genes with significant differential expression were determined based on the threshold of FDR< 0.05, log2FC ! 1 and p value < 0.05. According to the standard, 4,585 significant DEGs were identified, with 1,392 being up-regulated and 3,193 being down-regulated ( Table 2, S10  Table). When the significance of differential expression was not considered, 26,136 DEGs were obtained, with 11,735 being up-regulated and 14,401 being down-regulated genes (Table 2, S11 Table).

GO classified statistic and enrichment analysis of DEGs
Using the GO database, the significant DEGs were represented in the three main GO categories, which include biological process, cellular components, and molecular function (Fig 6, S12  Table). In the GO biological process category, with coYP as contrast, the gene expression of metabolic process was obvious. There were 375 up-regulated expression genes and 783 downregulated expression genes. Subsequently, a cellular process with 303 up-regulated and 662 down-regulated expression genes ensued. In the GO molecular function category, with coYP as contrast, gene expression of catalytic activity was significant. There were 326 up-regulated expression genes and 679 down-regulated expression genes. Then, binding with 293 up-regulated and 603 down-regulated expression genes ensued. The detailed classification could be found in Fig 6 and S12 Table. The software GOatools was used to further explore the enrichment of DEGs, significance in enrichment was considered if the p-value< 0.05. Fig 7 shows the GO enrichment of DEGs, with the name and classification of GO plotted on the x-axis and the enrichment ration plotted on the y axis. The color indicates the significance of enrichment, whereby increasing color intensity corresponds to greater enrichment of GO with FDR. There were only 25 GO that were significantly enriched in the coGS in contrast with coYP. For example, the enrichment ratio of photosystem II oxygen evolving complex (15/4585), photosynthesis (29/4585), light harvesting (17/4585), chlorophyll binding (18/4585), and photosystem (26/4585) were the highest with an FDR < 0.001; followed by plastoglobule (12/4585), response to karrikin (17/ 4585) and tetrapyrrole binding (69/4585) with an FDR<0.01. The enrichment ratio of thylakoid lumen (14/4585), protein-chromophore linkage (17/4585), and chloroplast stroma (58/ 4585) were low with an FDR < 0.05. However, the enrichment ratio of photosystem I reaction center, xyloglucosyl transferase activity, and extrinsic component of membrane were the lowest with FDR ! 0.05 (Fig 7, S13 Table).

KEGG enrichment analysis of DEGs
The DEGs between coYP and coGS were subjected to KEGG pathway enrichment analysis using the software KOBAS. Results showed that 1,670 DEGs were mapped to 226 KEGG pathways (S14 Table). The p-value < 0.05 was set as the threshold of significant enrichment. As a result, 434 (25.99%) DEGs were significantly enriched and were associated with 23 pathways (Fig 8, S14 Table). Among the 23 pathways, 10 pathways indicated the most significant enrichment (p-values < 0.001), which includes zeatin biosynthesis (2.07%, 9/

Transcription factor forecast and analysis
Transcription factors are proteins that modulate downstream gene expression at different levels by binding to promoter regions of the gene [44,45]. Furthermore, different transcription factors play a critical role in the regulation of different plant metabolic processes [45]. In this study, 581 transcription factors distributed to 50 transcription factor families were identified in C. officinalis (S16 Table). Among these, basic Helix-Loop-Helix (bHLH) family (41; 7.06%) was found to be the most abundant, followed by MYB (38; 6.54%), ERF (38; 6.54%), GRAS (34; 5.85%) and bZIP (32; 5.51%) (S16 Table).

Discussion
As one of the most commonly and extensively used TCM, Corni Fructus has been used to cure liver and kidney diseases, light-headedness, and pain and weakness in the loin and knees for thousands of years. C. officinalis is a Level 3 protected species, and is one of the 42 national key protected wild species of animals and plants [46]. However, there is little information available with regard to the molecular biology and genomic information of the species, which has impeded the intensive study on its molecular metabolic mechanism and germplasm innovation. In this study, we have performed large-scale transcriptome sequencing of the fruit and leaf tissue of C. officinalis with the use of advanced high throughput Illumina RNA-seq technology, which allowed the transcriptome of C. officinalis to be described for the first time and enriched the gene information of C. officinalis. In addition, the generated transcriptomic data of C. officinalis can help explore the molecular genetic and biochemical characteristics of C. officinalis and its related species. The assembly strategy of Trinity was used in this study as a unified method for transcriptome construction in the absence of a reference genome [28,[47][48][49]. The N50 is one of the most important indices used to assess the assembly quality, wherein longer N50 length corresponds to higher assembly quality. In our study, the N50 length of unigenes was 1445 bp (Table 1), which showed that the sequence assembly was high in quality and is suitable for intensive research. Moreover, the N50 could be compared with other plant species [49][50][51]. In addition, the average length of the unigenes was 856 bp (Table 1), which was superior to those in other reported species, such as Myrica rubra (531 bp) [52], Sorbus pohuashanensis (770 bp) [53] and Platycladus orientalis (534 bp) [54]. In this study, a total of 60,971,652 (coGS) and 57954134 (coYP) reads were cleaned and assembled de novo to produce 56,392 unigenes (Table 1). Based on sequence similarity searching, the unigenes were annotated and classified against the NCBI NR database, GO database, KOG terms, and KEGG pathways. Because of the limited genetic information available for C. officinalis, unknown unigenes were obtained for as many as 5,864 (14.25%) of the sequences (S1 Table). This result has also been found in other species [32]. Possibly, novel genes of C. officinalis could be found in these unknown unigenes, which were related with some unique biosynthesis processes and pathways in our results. Furthermore, the annotated unigenes of C. officinalis indicated the highest homology to those of Vitis vinifera (18.77%), followed by Theobroma cacao (6.17%), and Coffea canephora (5.44%) (S1 Table), which may indicate the evolutionary relationship among these species. In spite of a large number of unigene sequences that indicated no matches, many of unigenes were still assigned to a wide range of KOG and GO classifications. The results showed that our transcriptome data included much genetic information on C. officinalis. These novel unigenes provide an exciting opportunity to study the functional genes in C.officinalis. Similar findings have been reported in Chimonanthus praecox [32], Panax ginseng [36] and Prunus pseudocerasus [39]. The KEGG function annotation analysis showed that 16,182 unigenes were involved in 371 biosynthesis process. The largest number of unigenes (4,451) was associated with metabolic pathways, followed by those associated with biosynthesis of secondary metabolites. However, the smallest number of unigenes was associated with d-arginine and d-ornithine me tabolism, malaria, taste transduction, asthma, penicillin and cephalosporin biosynthesis, aflatoxin biosynthesis, indole alkaloid biosynthesis, and neuroactive ligand-receptor interaction, which have only 1 matching unigene (S6 Table). All of these data contribute to the study of the metabolic and biosynthesis mechanisms in C. officinalis.
In addition, we analyzed the DEGs between coGS and coYP using the edgeR software with the set threshold of FDR or p value < 0.05. The results showed that many genes indicated significant DEGs between coGS and coYP. We found two photosynthesis-related unigenes (c113745_g1, c95953_g1) that were up-regulated significantly at the coYP tissue and downregulated at the coGS (S10 Table). In addition, four unigenes (c50578_g2, c138903_g1, c94113_g1, and c32805_g1) associated with chlorophyll a/b binding protein were significantly up-regulated at the coYP tissue and down-regulated at the coGS (S10 Table). In fact, it was well known that the leaf was the only tissue that participates in photosynthesis and chlorophyll biosynthesis. Thus, it was well-understood that the expression of these unigenes is up-regulated in leaf.
Corni Fructus is a popular TCM in China, which came from the fruit of C. officinalis [46]. Terpeneis is main active ingredients [3]. In the study, we found that the two unigenes c101827_g3 andc78753_g1are related to terpene biosynthesis. The NCBI Blast results also showed that the sequence of c101827_g3 has the greater similarity with arabidopsis thaliana terpene synthase 21 (TPS21) (GenBank number: NM_122301) [55], which is the main enzyme in the biosynthesis of terpeneis in arabidopsis thaliana. As we know, the fruit of C. officinalis is its medicinal parts, and the terpene composition should mainly exist in the fruit of C. officinalis. In our study, the unigenes just were significantly up-regulated in coGS tissue and downregulated in coYP (S10 Table). The results hinted that the terpene is mainly found in the fruit of C. officinalis, which is consistent with earlier reports [3,5].
As everyone knows, MVA (mevalonate) and EMP (Embden-Meyerhof-Parnas pathway) are the basic pathways in the biosynthesis and emission of terpenes, which have been explored in many species [56][57][58][59]. In this study, we found many unigenes involved in terpene biosynthesis based on the unigenes functional annotation. In C. officinalis, terpenoid biosynthesis enzymes involved in the MVA and MEP pathways were distinguished with TPS (trehalosephosphate synthase, c79346_g1, c100180_g2, c151387_g1, c115293_g1, c151414_g1), DXS (1-deoxy-D-xylulose-5-phosphate synthase, c19035_g1, c87460_g2,c93144_g1), and DXR (1-deoxy-D-xylulose 5-phosphate reductoisomerase, c7855_g1, c112578_g1, c147202_ g1) (S3 Fig). All these enzymes play a pivotal role in terpene biosynthesis, which proved the involvement of terpene metabolic pathways in C. officinalis. Furthermore, there were multiple unigene sequences annotated to the same enzyme, in which unique sequences indicated different fragments of a single unigene, different members of a gene family, or both, this finding was similar to a previous report in American ginseng [59].

Conclusions
This was the first comprehensive report on C. officinalis transcriptome. The study presents the transcriptome sequencing results and analysis of C. officinalis leaf and fruit using the Illumina transcriptome sequence platform. A total of 56,392 unigenes with 48,264,743 bp were generated. In addition, we have explored DEGs between the leaf and fruit of C. officinalis using the edgeR database. This study has enriched the genetic data of C. officinalis and the indicated the potential of transcriptome sequencing for functional gene research in species where genomic sequence data are not yet available. We are confident that the results will lay the foundation for further functional genomics and molecular and molecular metabolic mechanism studies on C. officinalis.
Supporting information S1