De Novo Transcriptomes of Forsythia koreana Using a Novel Assembly Method: Insight into Tissue- and Species-Specific Expression of Lignan Biosynthesis-Related Gene

Forsythia spp. are perennial woody plants which are one of the most extensively used medicinal sources of Chinese medicines and functional diets owing to their lignan contents. Lignans have received widespread attention as leading compounds in the development of antitumor drugs and healthy diets for reducing the risks of lifestyle-related diseases. However, the molecular basis of Forsythia has yet to be established. In this study, we have verified de novo deep transcriptome of Forsythia koreana leaf and callus using the Illumina HiSeq 1500 platform. A total of 89 million reads were assembled into 116,824 contigs using Trinity, and 1,576 of the contigs displayed the sequence similarity to the enzymes responsible for plant specialized metabolism including lignan biosynthesis. Notably, gene ontology (GO) analysis indicated the remarkable enrichment of lignan-biosynthetic enzyme genes in the callus transcriptome. Nevertheless, precise annotation and molecular phylogenetic analyses were hindered by partial sequences of open reading frames (ORFs) of the Trinity-based contigs. To obtain more numerous contigs harboring a full-length ORF, we developed a novel overlapping layout consensus-based procedure, virtual primer-based sequence reassembly (VP-seq). VP-seq elucidated 709 full-length ORFs, whereas only 146 full-length ORFs were assembled by Trinity. The comparison of expression profiles of leaf and callus using VP-seq-based full-length ORFs revealed 50-fold upregulation of secoisolariciresinol dehydrogenase (SIRD) in callus. Expression and phylogenetic cluster analyses predicted candidates for matairesinol-glucosylating enzymes. We also performed VP-seq analysis of lignan-biosynthetic enzyme genes in the transcriptome data of other lignan-rich plants, Linum flavum, Linum usitatissimum and Podophyllum hexandrum. The comparative analysis indicated both common gene clusters involved in biosynthesis upstream of matairesinol such as SIRD and plant lineage-specific gene clusters, in particular, genes responsible for biosynthetic pathways for production of podophyllotoxin; CYP71BE54, a key enzyme gene for podophyllotoxin biosynthesis in P. hexandrum, was not found in L. flavum, although both P. hexandrum. and L. flavum yield podophyllotoxin. Altogether, these data have established the fruitful molecular basis of Forsythia and provided insight into the molecular evolution and diversity of lignan biosynthetic pathways.


Introduction
Plants biosynthesize a wide variety of specialized metabolites (formerly termed secondary metabolites), including alkaloids, flavonoids, isoflavonoids, and lignans [1][2][3]. These phytochemicals are produced by plant species-specific enzymatic biosynthesis cascades that are regulated by multiple endogenous factors and exogenous stimuli at genomic and transcriptional levels. Many plant specialized metabolites serve as major bioactive components in functional diets and clinical agents, including Chinese medicines, and as leading compounds in the development of novel synthetic drugs. The considerable recent increases in the aging population require the improvement of a healthy life expectancy via consistent and appropriate intake of low-cost healthy diets and clinical drugs, many of which are derived from plants. Furthermore, little is known about the biological significance of plant specialized metabolites in their host plants. Thus, comprehensive investigation, including transcriptomic analysis, of the regulatory mechanisms of plant specialized metabolite biosynthesis and associated biological processes is of particular importance in light of both medicinal and plant science.
Forsythia spp. (Oleaceae family), commonly known as Golden Bells, are perennial woody plants that include a large number of natural and cultivated varieties. The leaves and fruits are one of the most extensively used medicinal sources of Chinese medicines and functional diets owing to their high lignan contents [1][2][3]. Over the past few decades, the lignan biosynthetic pathways of Forsythia have been identified [1][2][3][4][5]. There is a growing body of evidence for the diverse biological effects of lignans on mammals, including antioxidative, antiobesity, antitumor, antiviral activities, and the reduction of the risk of various cancers and cardiovascular diseases [6][7][8]. Quite recently, we also demonstrated the marked usefulness of triple-transgenic Forsythia callus suspensions as efficient, stable, and sustainable platforms for the production of a beneficial non-Forsythia lignan, sesamin [9].
Collectively, these findings highlighted the potentially high significance of transcriptomic information of lignan-rich plants in more efficient transgenic metabolic engineering of lignan biosynthesis pathways as well as phytochemical studies of lignan biosynthesis. However, such molecular basis of Forsythia remains to be explored. A next-generation sequencing-based transcriptome is expected to provide a vast amount of fundamental information as to genes of enzymes responsible for unidentified lignan biosynthesis steps and/or gene expression profiles associated with lignan biosynthesis. In this paper, we investigated the de novo transcriptomes of Forsythia koreana leaf and callus using a novel contig assembly method, and compared the gene expression profiles of Forsythia with those of other lignan-rich plants: Podophyllum hexandrum, Linum usitatissimum, and L. flavum. from the calli as described previously [9] and were maintained at 22°C in Gamborg's B-5 liquid medium supplemented with 6% sucrose and 0.05 mg l-1 2,4-D (Cell Culture Medium; CCM). All suspensions were cultured on a rotary shaker at 110 rpm in the dark and were subcultured every 2 weeks with an inoculum of 5 ml of saturated suspension cells. Cell suspension cultures that have been maintained for 3 months were used for RNA extraction. Total RNA was extracted from in vitro-grown leaf or callus suspension cultures that had been maintained for 3 months, using an RNeasy Plant Mini Kit (Qiagen, Hilden, Germany) and was subjected to oncolumn DNase I digestion according to the manufacturer's instruction.

RNA-Seq and assembly
The cDNA libraries for RNA-seq were constructed using the TruSeq Stranded mRNA Sample Prep Kit (Illumina, Inc., San Diego, CA, USA), according to the manufacturer's protocol. The quality and concentrations of the libraries were verified using a Qubit fluorometer (Thermo Fisher Scientific, Waltham, MA, USA) and a BioAnalyzer system (Agilent Technologies, Santa Clara, CA, USA), as previously reported [10]. The libraries were then subjected to 101 × 2 cycles of paired-end sequencing, using the HiSeq 1500 platform (Illumina) in High Output Mode.
Total reads were extracted using CASAVA v1. 8.2 (Illumina), and subsequently, PCR duplicates, adaptor sequences, and low-quality reads were removed from the extracted reads as follows. Briefly, if the first 10 bases of the two reads were identical and the entire reads exhibited >90% similarity, the reads were considered PCR duplicates. Base calling from the 5 0 to the 3 0 end was started when the Phred scores of five sequential residues exceeded 30 and stopped when the frequency of accurately called bases dropped to 0.5. The remaining reads were assembled using Trinity (v2.0.6) [11], with normalization and a maximum coverage of 30. For each sample, the reads were aligned using a two-step method, in order to obtain reliable fragment per kilobase of transcript per million mapped reads (FPKM) values. Putative gene names for the contigs were estimated based on a homology search of the UniProt database, using blastn 2.29+ [12]. To determine expression levels, the total numbers of mapped reads were estimated using Bowtie (v2.1.0) [13], and the short reads were mapped against the contigs with Bowtie, allowing single mutations and ignoring the mismatch penalty for nucleotides with low-quality values (<20).

Gene ontology (GO) analysis and identification of plant specialized metabolism -related contigs
To predict the function of leaf-and callus-specific transcripts, the specific sequences were analyzed using GO enrichment analysis, as described previously [10]. Briefly, specific or constitutively expressed sequences were analyzed using Blast2GO for functional annotation [14], and the annotated biological information was then compared using GO 'is_a' graphs. The graph was drawn with Cytoscape (http://www.cytoscape.org/). To quantify the enrichment of GO terms, we calculated enrichment scores as follows: where N x (GO) indicates the frequency of each GO term for leaf (leaf<<callus)-or callus (cal-lus<<leaf)-specificgenes (i.e., x), and N total (X) indicates the frequency of leaf (leaf<<callus)-or callus (callus<<leaf)-specific genes (i.e., X) mapped to each GO term in the Blast2GO results. For correction, a pseudo-count was set as fixed value (0.05).
Contigs with sequences similar to enzymes involved in special metabolites biosynthesis were identified using blastx against a local reference database of Swiss-Prot data with GO terms for secondary metabolic processes (GO:0019748), with an e-value threshold of 0.001. Fulllength ORFs were detected using the algorism of TargetIdentifier [15].
Generation of full-length ORF sequences of contigs using Virtual Primer-Based method Trinity employs the de Bruijn graph algorism by splitting reads into much shorter k-mers, in order to efficiently handle the massive number of reads within a practical length of time. However, the overlap-layout-consensus assembly algorism [16,17] is more suitable for regions of low-depth coverage. Therefore, to overcome the difficulties caused by the low-depth regions of the Trinity-based contigs and to generate full-length ORF sequences, we developed a memory and time-efficient overlap-layout-consensus reassembling method, virtual primer-based sequence assembly (VP-seq; Fig 1), which comprises the following steps: (1) reads that were not mapped to Trinity-based contigs and reads mapped to transcripts were collected and stored in a hash table without pairing information; (2) the search key (virtual primer) sequences with a length of 20 bp were set for the 5 0 and 3 0 ends of the transcripts; (3) reads containing the primer sequences were mapped to the incomplete ORF sequences; and (4) the elongated nucleotide sequences were determined from the primers (5 0 primer to 5 0 end and 3 0 primer to 3 0 end), using the following equation: where the posterior probabilities of nucleotides were calculated and the elongating nucleotides including ambiguous nucleotides (R, Y, K, M, S, W) were selected to maximize the posterior probabilities, and to avoid low complexity or sequence contamination, the occurrence of successive ambiguous nucleotides was limited. Furthermore, where N(position = i) represents the number of reads mapped to position i and N(nt | position = i) is the number of reads with contained nucleotide nt at position i. For correction, a pseudo-count n was set to 20, and finally, (5) the process was repeated from step (2) and until no nucleotides were added to either end. Full-length ORFs were detected using the algorism of TargetIdentifier [15]. Unlike the OLC algorism, VP-seq utilizes only both ends of contigs as virtual primer regions to reduce computational cost. To collect adequate reads with virtual primer-based search for posterior probability calculation and contig elongation, two features were implemented in VPseq. Firstly, a single mismatch was allowed for searching reads containing virtual primer sequences. In the first-round VP-seq, which utilized virtual primer regions corresponding to both ends of original contigs obtained using Trinity, a single mismatch was allowed between the first virtual primers and reads containing the virtual primer sequences. This mismatch allowance enabled the complimentary recognition of read sequences containing a single base call error or a single nucleotide variation by virtual primers, followed by the posterior probability estimation and elongation. Secondly, the first-round VP-seq-based contigs containing ambiguity nucleotide codes were used as templates for design of the second-or later-round virtual primers for further contig elongation. Consequently, contigs were further elongated using read sequences derived from both heterogenous alleles regardless of nucleotide variations at virtual primer regions. To evaluate the efficiency of VP-seq, the computation time for VP-seq including the process of Trinity and TGICL was monitored.

Quantitative reverse transcription-PCR (qRT-PCR)
To verify the results obtained by RNA-Seq, the expression of three lignan biosynthetic genes (FkSIRD, FkMOMT, and FkUGT74S1) was analyzed by qRT-PCR as previously described [18]

Phylogenetic analysis for lignan-biosynthetic enzymes
The RNA-seq data for VP-seq and FPKM calculation of P. hexandrum, L. usitatissimum, and L. flavum were obtained from the MedPlant RNA Seq Database (http://www.medplantrnaseq. org/; assembled contigs version 20101112). After eliminating sequences without Pfam domains, the full-length sequences obtained via VP-seq were translated and subjected to Pfam search [18]. The hit sequences for the respective Pfam domains were then translated to amino acid sequences, aligned using ClustalW2, and phylogenetic trees were reconstructed using unweighted pair group method with arithmetic mean (UPGMA) in ClustalW2.

RNA-Seq and de novo assembly
The RNA-Seq for the F. koreana callus and leaf using HiSeq 1500 yielded 41 and 47 million reads for 101 paired-end read, respectively ( Table 1). The resulting fastq files were deposited on SRR2075825 (callus) and SRR2075824 (leaf). After base calling for clean reads, we obtained 39 and 45 million reads, respectively, with the average length of 97.4 bp paired-end reads ( Table 1). Trinity (v2.0.6) analysis of the paired-end reads generated 383,714 contigs. Furthermore, reassembly of these contigs using TGICL (v2.1) led to the elucidation of 116,825 unigenes with an N50 of 977 from both the callus and leaf ( Table 1). The unigenes covered 86.16% and 85.31% of reads from the callus and leaf libraries, respectively, and 79.46% of the unigenes

GO term enrichment analysis
To predict the biological functions of the 6,843 genes differentially expressed in the leaf and callus (3,963 genes in the leaf, and 2,880 genes in the callus), the resulting contig sequences were annotated by BLAST2GO [14]. The enrichment score was calculated as described in our previous report [10]. The annotated biological information between leaf and callus was compared by heat map in graph view for biological process (Fig 2). Nodes represent ontology terms and edges are 'is_a' relations used in the GO Biological Process ontology. As depicted in Fig 2, photosynthesis-related (photosynthesis, light harvesting: GO:0009765; electron transport chain: GO:0022900; chloroplast organization: GO:0009658; response to light stimulus: GO:0009416); inorganic ion, amino acid, and peptide transport-related (GO:0006811); protein metabolic process-related (GO:0019538); and lipid metabolism-related (GO:0008610 and GO:0006720) genes were upregulated in the F. koreana leaf, whereas metabolic processrelated (secondary metabolic process: GO:0019748; macromolecule catabolic process: GO:0009057), body structure organization-related (external encapsulating structure organization: GO:0045229), anatomical structure development-related (GO:0048856 and GO:0003006), and salt stress response-related (GO:0009651) genes were upregulated in F. koreana callus.
More than 50% of upregulated transporter-related (GO:0006811; 27/48) and protein metabolic process-related genes (GO:0019538; 67/121) were chloroplast proteins responsible for photosynthesis and electron transport. For example, the photosystem I complex subunits psaK (TR52663_c0_g1_i1), psiOs (TR52663_c0_g1_i1 and TR48474_c0_g2_i1), and petC (CL19679Contig1), which constitute an integral chloroplast membrane protein complex for photosynthetic light reactions, were all upregulated in leaf and categorized as both photosynthesis-and transport-related processes. Moreover, the upregulated protein metabolic process genes reflected the rapid degradation and reassembly cycle of the photosystem complex. In addition, the chlorophyll-binding proteins LHCA (CL1468Contig3, CL3961Contig1, CL12344Contig1, and CL15018Contig1) and LHCB (TR48065_c1_g3_i4, CL427Contig2, CL8878Contig1, and TR45565_c1_g2_i1), as well as FtsH1 (CL11358Contig1), which is a component of the FtsH protease, were all upregulated in leaf and were categorized as both photosynthesis and protein metabolic processes. In addition, the genes for lipid metabolic process were also upregulated in the leaf. These GO terms were typical biological processes in leaves, where photosynthetic-related metabolic processes are exerted under light condition.
In contrast, the anatomical structure development-related (GO:0048856) and salt stress response-related (GO:0009651) genes were upregulated in callus. Such gene upregulation was also detected in Arabidopsis thaliana calli [20]. Intriguingly, the most enriched GO term in the callus was found to be implicated with secondary metabolic process (GO:0019748). Therefore, we investigated the functional characteristics of Forsythia genes involved in specialized metabolite synthesis.
Collection and annotation of plant specialized metabolism -related enzymes using "Virtual Primer"-based sequence assembly Considering that Forsythia is a well-known medicinal plant that produces a wide variety of phenylpropanoid-derived specialized metabolites including lignans, we attempted to annotate the sequences of these metabolite-related genes. Initially, we subjected the contigs to blastx analysis against known UniProt enzyme sequences with each Pfam motif (Table 2), and obtained 1,576 contigs with e-values of <0.001. However, 90.8% (1,430) of the contigs harbored partial ORFs and no Pfam domain, which were insufficient for precise functional annotation and molecular phylogenetic analyses. Such incomplete annotation of partial ORFs obtained using Trinity was frequently found in other Trinity-based contigs of de novo transcriptomes [21,22], indicating that most of Trinity contigs can be applied for GO enrichment, but not for precise annotation and molecular phylogenetic analyses. Such short contigs are attributed to the Trinity's k-mer-based fractionation methods, given that de Bruijn graphbased algorisms, including Trinity, fractionate the reads to k-mers and select k-mers (>1.5 information contents) to reduce the calculation cost. This fractionation can be optimized only for highly expressed regions, resulting in the formation of short contigs [23]. Such strategy is an efficient computation approach for assembling vast amounts of short reads and is adopted by most de novo RNA assemblers, such as SOAPdenovo-Trans [24], Velvet/Oases [25] and Trans-ABySS [26]. However, the k-mer size cannot be optimized for whole transcript sequences due to the heterogeneous depth caused by different expression levels of respective transcripts and the exclusion of read number information by k-mer selection.
To increase contigs containing full-length ORFs, we developed a post-assembly method, named Virtual Primer-based Sequence assembly (VP-seq) that elongates 5'-and 3'-ends of generated contigs. VP-seq involves reassembly of Trinity-based contigs using an overlap-layout-consensus (OLC) algorism [16,17] that does not fractionate reads to k-mers. Eventually, Trinity-based contigs are elongated in silico which is reminiscent of rapid amplification of cDNA ends (RACE). The OLC algorism involves all-against-all and pairwise read alignments. Although many OLC algorism-based assemblers including Newbler and Celera assembler apply for K-mer based seed & extend algorism to find overlapped sequences to reduce the calculation cost, pairwise alignment step still has O(n 2 ) computational cost, where O(x) represents the cost for calculation directly proportional to x and n is the number of input reads. Indeed, Newbler failed to complete the pairwise alignment step for 89 million Forsythia paired end RNA seq reads in one month. On the other hand, the computational order for VP-seq is O(n Ã m), where n is the number of input reads and m is the number of contigs of interest from Trinity outputs, which is much smaller than n, given that VP-seq involves all-against-target and pairwise alignments for virtual primer sequence-containing reads. Indeed, VP-seq, including Trinity and TGICL (32 hours), estimated the full length output for 1,576 specialized metabolism-related contigs in 62 hours using a 80-core 2TB server for our following validation. These results indicated the marked advantages in efficient calculation cost and precise elongation of contigs over the OLC algorism.
To validate the VP-seq method, we attempted to reconstitute three known sequences (CL1986Contig2 for FkUGT7, CL19667Contig1 for DXR and CL23194Contig2 for psd_Fi2), the full-length sequences of which failed to be assembled by Trinity. As shown in Fig 3 and S1-S3 Figs, the newly elongated sequences coincided with those deposited in the DDBJ database except for minor variants, probably due to the variations or polymorphisms within the species. In general, Trinity is considered to extend a contig, when various reads corresponding to the terminal regions of a contig are abundantly available. However, Trinity failed to extend contigs in the present study despite a high number of reads corresponding to the terminal regions of the contigs (Fig 3, arrows), suggesting that such incomplete assembly by Trinity resulted in the elucidation of only partial ORF sequences. Altogether, these results provide evidence that our VP-seq assembly method is the efficient and reliable method for assembling full-length ORFs from short Trinity-based contigs (input data 1) assembled by reads (input data 2) from the present de novo RNA-Seq.
Subsequently, VP-seq of the aforementioned 1,576 contigs revealed that 480 contigs harbor redundant sequences. 709 of the independent 1,090 contigs were found to harbor full-length sequences (Table 3). A striking feature is that 563 new full-length ORF sequences were elucidated by VP-seq (Table 3). These data indicate that VP-seq improved the percentage of contigs containing a full-length ORF to 72%. Moreover, VP-seq was shown to elucidate the full-length ORF sequences of the contigs with a mean length of 1,146 base pairs via extension of 253 base pairs in average. As shown in Fig 4A, 68% of the VP-seq-based contigs were >1200 bp in length, whereas only 32% of the Trinity-based contigs were >1200 bp. Considering that the mean length of the reference UniProt enzyme sequences was 400 amino acids, we concluded that the length of the VP-seq-based contigs was sufficient for further annotation and molecular phylogenetic analyses, In addition, we examined the correlation between the depth (Fragment Per Kilobase Per Megareads, FPKM) and the percentage of full-length ORF contigs to total contigs elucidated by VP-seq ( Fig 4B). Full-length ORF sequences for contigs with low expression levels (FPKM < 0.01) were not recovered, whereas those of 77% highly expressed contigs (FPKM >1) were elucidated (Fig 4B).
We also compared the ORF sequences of the Trinity-based contigs and VP-seq-based contigs to determine factors causing sequence redundancy. The sequence alignment of these ORFs revealed that the redundancy was derived from 5' and 3'fragments of the same transcripts ( Fig 5A). Although such asymmetric alignments were observed in 10.6% of Trinity-based contig pairs, VP-seq reassembly reduced the percentage to 1.5% (Fig 5B). Since the asymmetric alignments among partial ORFs bias the estimation of copy numbers and phylogenetic distances of transcripts, full-length ORF sequences of contigs are requisite for the following quantification of expression levels and molecular phylogenetic analysis. In combination, these results confirmed the marked usefulness of the VP-Seq method for the elucidation of fulllength ORF sequences of contigs.

Structural classification of multiple metabolic enzyme genes
The large number of full-length ORF sequences obtained by VP-seq enabled a rigorous analysis of the distribution of contigs among Pfam-based structural classifications. The Pfam search identified 203 p450s, 82 UDP-glucose-dependent glucosyltransferases (UGTs), 62 non-heme dioxygenases, 54 transferases, 141 alcohol dehydrogenases, 9 Nmr A family members, 26 Table 3. Summary of the assembly of metabolism-related contigs from Forsythia koreana by Trinity or virtual primer-based sequence assembly (VP-seq). methyltransferases, and 13 chalcone and stilbene synthases. Of those 590 sequences, 34% (223) and 23% (137) were upregulated by more than two-fold in callus and leaf, respectively, which was much higher than the percentage of differentially expressed sequences across the whole transcriptome (3% for callus and 4% for leaf). The GO:0009809 (lignin metabolism) includes lignan-biosynthetic genes. Thus, we further analyzed the VP-seq-based contigs against GO:000980, and verified the enrichment of 15 sequences in the callus library. The lignan pathway [26] (or see Fig 5) starts with deamination of phenylalanine by PAL1 to cinnamic acid, which is catalyzed to p-coumaric acid by CYP73A5. Then, p-coumaroyl-CoA is generated from p-coumaric acid by 4-coumarate-CoA ligases (4CLs: 4CL1, 4CL3, 4CLL6 and 4CLL7) and is converted to feruloyl-CoA via hydroxylation by FAH and HST, followed by methylation by F4IAT4. Feruloyl-CoA is catalyzed to coniferyl alcohol by cinnamoyl-CoA reductase I (CCR1) and cinnamyl alcohol dehydrogenases (CADs: CAD1, CAD5, CAD9 and ATCAD4).

# contigs Total length (bp) Mean length (bp) Full-length ORFs
Comparison of FPKMs between callus and leaf revealed that expression of the enzymes catalyzing the first two steps, PAL1 and CYP73A5, were 16-fold and 12-fold upregulated in callus, respectively. The expression pattern of 4CLs varied among subtypes; 4CL1 and 4CL3 were 3-fold and 30-fold upregulated in callus, respectively, while 4CLL6 was 100-fold upregulated in leaf (Fig 6A and 6B). Since the FPKM value for 4CL3 in callus (144.3) was much larger than the value for 4CLL6 in leaf (28.7), 4-coumarate-CoA ligase activity is highly likely to be more potent in callus than leaf. The enzymes that convert p-coumaroyl-CoA to feruloyl-CoA (FAH, HST and F4IAT4) were also upregulated, although FAH and F4IAT4 were expressed at low levels (FPKM < 4) in both leaf and callus. The downstream reductases (CAD1, 5, 9 and ATCAD4) and alcohol dehydrogenase were 2.6-to 2.8-fold upregulated in callus. Altogether, these analyses suggested that coniferyl alcohol, which is the direct precursor compound of a basal lignan, pinoresinol, was more abundantly synthesized in callus than in leaf.
The later lignan biosynthesis pathway originate from the coupling of stereo-specific coniferyl alcohol in the presence of diligent protein (DIR) [27], leading to the generation of pinoresinol. Pinoresinol is stepwisely reduced to lariciresinol and then secoisolariciresinol by pinoresinol-lariciresinol reductase (PLR). Secoisolariciresinol is converted into matairesinol by secoisolariciresinol dehydrogenase (SIRD). Furthermore, matairesinol is methoxylated to arctigenin by matairesinol O-methyltransferase (MOMT) [28]. These lignans are known to be Although no over-lapping region (distance = 1) was found between the Trinity-based contigs CL30807Contig1 (red) and CL177591Contig1 (blue), the two contigs were identical (distance = 0) after elongation by virtual primer-based sequence assembly (VP-seq). (B) Relationship between the distances of Trinity-based and VP-seq-based contigs.
doi:10.1371/journal.pone.0164805.g005 glucosylated by UDP-glucose-dependent glucosyltransferases (UGTs). UGT71A18 [29], and UGT74S1 glucosylate pinoresinol and secoisolariciresinol [30] respectively, whereas no matairesinol-or arctigenin-glucosylating enzymes have been identified. All of the enzymes responsible for the biosynthesis of lignan aglycones were upregulated in callus. The expression levels of DIR, PLR, SIRD, and MOMT were 40-fold, 5-fold, 50-fold, and 2-fold higher in callus than in leaf, respectively (Fig 6A), proving that matairesinol biosynthesis is prominently enhanced and that matairesinol is highly accumulated in callus. Such intense expression of SIRD in callus is compatible with our previous findings that matairesinol, but not arctigenin, is specifically accumulated in Forsythia callus suspension culture [31], highlighting a novel pivotal role of matairesinol in the growth, proliferation, or differentiation of Forsythia callus. To confirm the results obtained by VP-Seq-based RNA-seq, the expression of three genes responsible for biosynthesis downstream of lariciresinol, FkSIRD, FkMOMT and FkUGT74S1 (Fig 6A), was examined by qRT-PCR, because these genes showed distinct callus-selective (FkSIRD and FkMOMT) or leaf-selective (FkUGT74S1) expression ( Fig 6B). As shown in Fig 6C, FkSIRD and FkMOMT was found to be expressed predominantly in callus, whereas expression of FkUGT74S1 was much more intense in leaf than in callus. These results were in good agreement with the VPseq-based RNA-seq data (Fig 6B), confirming the quality and reliability of VP-seq.
Since lignans often accumulate as water-soluble glucosides in lignan producing plants, we also analyzed the expression levels of UGTs or their candidates, which were deduced from VPseq-based contigs (Fig 7). CL128Contig2 showed the most remarkable (52-fold) upregulation in Forsythia callus, although no known homologous UGTs were found by sequence comparison. To annotate the upregulated UGT genes in Forsythia callus, the molecular phylogenetic trees for UGT-like sequences of the full-length contigs were constructed using clustal W2 (Fig  7). It is noteworthy that the similarity of chemical structures of substrates for UGTs is implicated with the phylogenetic proximity of UGTs (Fig 7). Consistent with similar chemical structures of secoisolariciresinol and matairesinol (Fig 5), these data suggest that secoisolariciresinol glucosyltransferases (UGT74S1) is positioned proximally to a matairesinol glucosyltransferase that has not ever been identified. Notably, CL10456C1, CL14684Contig1 and CL15275Contig1 are also located at the proximal position to UGT74S1 (Fig 6B), although their substrates have yet to be identified. Moreover, CL10456Contig1, CL14684Contig1 and CL15275Contig1 exhibited 50% sequence identities to UGT74S1 (S1 Table), and the three glucosyltransferases were upregulated in callus by 2.9-, 1.7-, and 3.3-fold, respectively (Fig 6A and 6B). In keeping with the highest expression of SIRD (matairesinol biosynthetic enzyme) in callus as stated above, a matairesinol glucosyltransferase is highly likely to be upregulated in callus as well, given that a large portion of matairesinol was shown to be glucosylated in Forsythia callus suspension [31]. In combination, the molecular phylogenetic analyses of VP-seq-based full-length ORF sequences suggest that either some or all of the three Forsythia UGTs, CL10456Contig1, CL14684Contig1 and CL15275Contig1, are responsible for the glucosylation of matairesinol.

VP-seq of other lignan-producing plant transcriptomes
Other lignan-rich plants biosynsethize structurally diverged lignans with beneficial biological effects on human, including anti-tumor activity of podophyllotoxin, which are not produced in Forsythia [4,5], and RNA-seq data have been reported for several typical lignan-producing plants, including L. flavum, L. usitatissimum, and P. hexandrum [27,[32][33][34] (Table 4). The Illumina 54-bp paired end read data for other plants (Linum flavum, Linum usitatissimum and Podophyllum hexandrum) were downloaded from the MedPlantRNASeq DB [http://www. medplantrnaseq.org/] ( Table 4). Since the RNA-seq analyses for these plants provide shorter reads than those of the present study (probably due to 54bp paired end reads), we used VP-seq to reassemble the data into longer contigs (Table 5).
A total of 1,909, 1,173 and 2,146 plant specialized metabolism -related contigs were collected for L. flavum, L. usitatissimum, and P. hexandrum, respectively. 80-90% of these contigs were shown to harbor incomplete ORFs, as detected in Forsythia (Table 5). Nevertheless, subsequent VP-Seq re-assembly for the contigs led to the elucidation of 427, 174 and 284 additional full-length ORFs for L. flavum, L. usitatissimum and P. hexandrum, respectively (Table 5). In addition, as in Forsythia, 45-55% contigs were found to be redundant sequences. Finally, we detected 812, 401, and 570 full-length ORFs in L. flavum, L. usitatissimum and P. hexandrum, respectively (Table 5). These also data confirmed the universal ability of VP-seq to reassemble reads as short as 50 bp. To investigate the phylogenetic distribution of specialized metabolite enzyme genes, we performed comparative analysis among four lignan-rich plant species: F. koreana, L. flavum, L. usitatissimum, and P. hexandrum. The sequences harboring each of the 14 Pfam motifs were aligned with amino acid sequences in a Gonnet series, and phylogenetic trees were reconstructed using UPGMA with 1000 bootstrap replicates (S4-S17 Figs). Subsequently, we examined the relationship between the molecular phylogenetic distance of orthologs/ paralogs and the species-specificity of metabolic pathways, given that the four plant species possess unique lignan biosynthetic pathways. Analysis of the 14 resultant phylogenetic trees based on R function, cuttree, resulted in the generation of 353 gene clusters (Table 6). Of those, 32 clusters were shared by all of the four plant species. 58 and 71 clusters were shared with three and two species, respectively. Among the remaining 192 clusters, 52 contained two or more genes from a single species. Subsequently, we examined the molecular phylogenetic distance for orthologues/paralogs and the species-specificity of lignan biosynthesis-related genes, given that the four plant species possess unique lignan biosynthetic pathways downstream of secoisolariciresinol (Fig 8A). PLRs were shown to be present in all of the four plants (Fig 8B), which is compatible with the findings that pinoresinol, lariciresinol, and secoisolariciresinol are detected in these plants [3,4]. Matairesinol, which is biosynthesized from secoisolariciresinol, has been detected in F. koreana, L. flavum, and P. hexandrum, and has been shown to undergo two metabolic pathways. First, Forsythia MOMT methylate matairesinol, leading to arctigenin [28], and second, a P450 family enzyme in P. hexandrum, CYP719A23, converts matairesinol to pluviatolide [34]. A recent study also demonstrated that pluviatolide was converted to yatein by three P. hexandrum enzymes: PhOMT3, CYP71CU1 and PhOMT1 [35] (Fig 8A). Although the key enzyme has yet to be fully identified, yatein is predicted to be the intermediate of podophyllotoxin, and Lau et. al. reported that three P. hexandrum enzymes participate in the  ADH_N  3  0  3  4  1  0  0  0  1  0  2  2  0  0  0  0  2  7  3  28   ADH_zinc_N  3  1  2  3  0  0  0  1  0  1  1  2  0  0  0  0  1  4  3  22   Chal_sti_synt_N  2  1  0  3  1  1  0  0  1  1  1  2  0  1  0  1  0  1  1  17   DIOX_N  5  3  4  3  1 Total  47  24  32  37  15  19  1  17  11  10  13  22  11  4  14  11  9  24  32   Forsythia Transcriptome biosynthesis of podophyllotoxin ( Fig 8A). Additionally, thujaplicatin has been reported as an alternative intermediate in Anthriscus sylvestris [28]. Although intermediates and the enzymes involved in the conversion from matairesinol to podophyllotoxin have not yet been identified, podophyllotoxin has also been detected in L. flavum [3,4]. During our phylogenetic analysis, all known PLRs (FkPLR, LuPLR1, LuPLR2 and PhPLR) were clustered in NmrA-2 ( Fig 8B). Although no PLR have yet been characterized in L. flavum, the NmrA-2 cluster also contained two L. flavum genes (medp_linfl_20101112_1685 and med-p_linfl_20101112_20619), which are homologous to two LuPLRs (LuPLR1 and LuPLR2). Furthermore, medp_linfl_20101112_1685 and medp_linfl_20101112_20619were proximal to LuPLR1 and LuPLR2, respectively. In addition, the phylogenetic cluster indicated that Linum PLR1s and PLR2s were the most similar to PhPLR and FkPLR, respectively (Fig 8B). LuPLR1, medp_linfl_20101112_1685 (LfPLR1), and PhPLR were expressed with FPKMs of 54.8, 20.6, and 25.2, respectively, whereas FkPLR, LuPLR2, and medp_linfl_20101112_20619(LfPLR2) were more highly expressed with FPKMs of 176.9, 220.7, and 124.0, respectively. The adh_short-1 cluster contained total 5 sequences, including FkSIRD, from the four species. Only L. flavum was shown to possess two paralogs (medp_linfl_20101112_25640 and medp_linfl_20101112_43351) in the cluster, both of which were significantly expressed (FPKM = 21.2 and 77.2, respectively). Together, these results suggest that the two SIRD-like genes are L. flavum-specific subtypes. FkSIRD, PhSIRD, and medp_linus_20101112_26787 were expressed with FPKMs of 62.5, 11.8, and 62.5, respectively, confirming their biological significance (Fig 8B).
Pinoresinol and secoisolariciresinol are glucosylated by UGT71A18 and UGT74S1, respectively [29,30]. UGT71A18 and UGT74S1 were included in the cluster UDPGT-13 and UDPGT-1, respectively. Both of the clusters are constituted by the genes from all four species. Unlike the UDPGT-1 cluster, the UDPGT-13 cluster was further separated into plant speciesspecific sub-clusters. For example, FkUGT74S1 formed a sub-cluster with two F. koreana UGTs (CL1411Contig2 and CL14315Contig2), and sub-cluster formation was also observed among the L. flavum UGTs (medp_linfl_201112_2514, medp_linfl_201112_77217, med-p_linfl_201112_66634, and medp_linfl_20101112_28509). These molecular phylogenetic trees suggest the plant lineage-specific multiplication and diversification of UGT71A subfamilyrelated genes. FkMOMT, which converts matairesinol to arctigenin, was included in the F. koreana-specific cluster (cluster methyltransf_2-8), which coincides with the specific detection of arctigenin in F. koreana among the four plant species [3]. In contrast, the genes involved in the metabolic pathway from matairesinol to podophyllotoxin were restricted to podophyllotoxinproducing plant (L. flavum and P. hexandrum)-specific clusters, which is also in good agreement with the exclusive detection of podophyllotoxin in L. flavum and P. hexandrum.
CYP719A23, which converts matairesinol to pluviatolide [34], was located in the P. hexandrum-specific cluster, p450-29, and was expressed with the highest FPKM of 331.4 in the cluster (Fig 8B). In contrast, no CYP719A23-related L. flavum-specific clusters with FPKM values of > 10 were found. The lack of a CYP719A23-like contig in L. flavum suggests that the alternative podophyllotoxin metabolic pathway originated from convergent evolution. PhOMT3, which oxidizes pluviatolide to5' desmethoxyyatein, was also located in the P. hexandrum-specific cluster, methyltransf_2-2, and was expressed with an FPKM of 10.4. Two other genes in the methyltransf_2-2 cluster, medp_podhe_20101112_9385 and medp_podhe_20101112_6084, exhibited high FPKM values of 20.6 and 19.0, respectively. Although a pluviatolide synthase (a CYP719A23-like gene) was not detected in the L. flavum transcriptome in the present study, the presence of the gene proximal to PhOMT3 suggested the existence of a similar catalytic pathway in L. flavum. In contrast to the two aforementioned enzymes, CYP71CU1, which oxidizes 5' desmethoxyyatein to 5' desmethylyatein [35], formed a podophyllotoxin-producing plants-specific cluster, p450-42 that contained the proximal L. flavum gene, medp_linfl_20101112_35191.
CYP71CU1 and medp_linfl_20101112_35191 exhibited the highest FPKMs in the cluster, with values of 20.8 and 18.2, respectively. Although little is known about the biosynthesis of podophyllotoxin in L. flavum, the structural similarity of medp_linfl_20101112_35191 to CYP71CU1 and high expression in lignan-producing plants suggests the involvement of this F. flavum P450 in lignan biosynthesis. Moreover, other enzymes for podophyllotoxin biosynthesis, PhOMT1 and Ph2-ODD, also formed P. hexandrum-specific clusters, Methyltransf_2-2 and DIOX_N-9, respectively. In each cluster, PhOMT1 and Ph2-ODD also exhibited the highest FPKMs with values of 16.3 and 272.4, respectively. Collectively, the phylogenetic clustering analyses support the notion that podophyllotoxin biosynthesis in the two species had evolved independently. Lau et al. also reported that two different P450 enzymes, CYP71BE54 and CYP82D61 are responsible for converting deoxypodophyllotoxin to 4'-desmethyl-deoxypodophyllotoxin and 4'-desmethyldeoxypodophyllotoxin to 4'-desmthyl-epipodophyllotoxin, respectively [35]. CYP82D61 was clustered with 13 uncharacterized L. flavum P450s (cluster p450-41), medp_linfl_20101112_48131 exhibited an outstanding FPKM value of 786.9 (ten-fold higher than that of CYP82D61). In contrast, no apparent CYP71BE54-related gene was found in transcriptomes from L. flavum, also underpinning the convergent evolution of podophyllotoxin biosynthetic enzymes.

Conclusion
We have successfully developed a novel bioinformatics application, VP-seq, which enables effective elucidation of full-length ORF sequences via extension of incomplete contigs generated by Trinity de novo assembly of F. koreana leaf and callus. Furthermore, we demonstrated comparative VP-seq-based transcriptome analyses of major lignan-producing plants leading to the detection of complete ORF sequences of candidate genes for unidentified lignan biosynthesis pathways. Thus, VP-seq accelerated the characterization of enzymes that catalyze undetermined steps in lignan biosynthesis and is also widely applicable to the elucidation of full-length ORF sequences of any de novo assembled incomplete contigs in transcriptomes.
The molecular phylogenetic analysis of specialized metabolic enzyme genes from lignan-producing plants revealed both common gene clusters that include genes from various plants and plant lineage-specific gene clusters. The former include enzymes involved in the early common lignan biosynthesis upstream of matairesinol such as PLR and SIRD, suggesting that they have occurred in their ancestral plants and conserved their biological functions. In contrast, the latter, such as CYP82D61, are likely to participate in lignan biosynthesis downstream of matairesinol. The unique gene sets for podophyllotoxin biosynthesis in P. hexandrum and L. flavum, which were originally verified by rigorous analyses of VP-seq-based full-length ORFs, also provide insight into the molecular evolution and diversity of species-specific enzymes involved in biosynthesis of various structurally related lignans at later steps, and the molecular basis for novel transgenic metabolic engineering of these lignan-producing plants.