De novo transcriptome sequence of Senna tora provides insights into anthraquinone biosynthesis

Senna tora is an annual herb with rich source of anthraquinones that have tremendous pharmacological properties. However, there is little mention of genetic information for this species, especially regarding the biosynthetic pathways of anthraquinones. To understand the key genes and regulatory mechanism of anthraquinone biosynthesis pathways, we performed spatial and temporal transcriptome sequencing of S. tora using short RNA sequencing (RNA-Seq) and long-read isoform sequencing (Iso-Seq) technologies, and generated two unigene sets composed of 118,635 and 39,364, respectively. A comprehensive functional annotation and classification with multiple public databases identified array of genes involved in major secondary metabolite biosynthesis pathways and important transcription factor (TF) families (MYB, MYB-related, AP2/ERF, C2C2-YABBY, and bHLH). Differential expression analysis indicated that the expression level of genes involved in anthraquinone biosynthetic pathway regulates differently depending on the degree of tissues and seeds development. Furthermore, we identified that the amount of anthraquinone compounds were greater in late seeds than early ones. In conclusion, these results provide a rich resource for understanding the anthraquinone metabolism in S. tora.


Introduction
Senna tora (Subfamily, Caesalpiniaceae; and Family, Leguminosae) also known as Cassia tora, is an annual shrub which grows in the arid zones after the rainy season [1]. This plant is mostly found in India, China, Sri Lanka, Nepal, the Korean peninsula, and other Asian countries. Its name varies in different locales such as Foetid Senna tora, Sickle senna, Wild senna, Coffee pod, Tovara, Chakvad, and Ringworm plant. S. tora leaves, seeds, and roots have long been used as food ingredients. It is also valued as a medicinal plant in Ayurveda, commonly used as

Plant material and RNA preparation
Specimens of S. tora (cv. Myeongyun) were grown in an experimental plot of National Institute of Horticultural and Herbal Science (Eumseong) field. The distance between two adjacent plants was maintained at 50×40 cm and fertilizer (N-P20-K20) was applied at 8-10 Kg per 10a. Leaf, root, and early-and late-stage seed tissues were harvested from healthy plants, and stored at -80˚C until used for RNA extraction. Total RNA was extracted from leafs, roots, and two stages of seeds of S. tora using the RNeasy Plant Mini kit (Qiagen, InS., Valencia, CA, USA). RNA purity was determined using NanoDrop8000 Spectrophotometer and Agilent Technologies 2100 Bioanalyzer, and total RNA integrity was identified as having a minimum integrity value of 7.

Illumina short-read sequencing
The poly (A) + mRNA was purified and fragmented from 1 μg of total RNA using poly-T oligoattached magnetic beads by two rounds of purification. Using reverse transcriptase, random hexamer primers, and dUTP, the randomly-cleaved RNA fragments were transcribed reversely into first-strand cDNA. A single A-base was added to these cDNA fragments followed by adapter ligation. The products were purified and concentrated by PCR in order to generate a final-strand specific cDNA library. The quality of the amplified libraries was verified using capillary electrophoresis (Bioanalyzer, Agilent). Quantitative PCR (qPCR) was carried out using SYBR Green PCR Master Mix (Applied Biosystems). Then we pooled together equimolar amounts of libraries that were index-tagged. The cBot-automated cluster creation system (Illumina) performed cluster generation in the flow cell. The sequencing was performed with 2 x 100 bp read length of the flow cell loaded on a HiSeq 2500 sequencing system (Illumina).

Long-read sequencing
Libraries for Pacific Biosciences Single Molecule Real Time (SMRT) sequencing were prepared from the pooled samples of aforementioned cDNAs. Cycle optimization was performed to determine the optimal number of cycles for large-scale PCR. We prepared 3 fraction cDNAs (1-2 kb, 2-3 kb, and 3-6 kb) using the BluePippin Size selection system. The SMRTbell library was constructed by using SMRTbell TM Template Prep Kit (PN 100-259-100). The DNA/Polymerase Binding Kit P6 (PacBio) was used for DNA synthesis after the sequencing primer annealed to the SMRTbell template. Following the polymerase binding reaction, the MagBead Kit was used to bind the library complex with MagBeads before sequencing. MagBead-bound cDNA complexes result in increased number of reads per SMRT cell. This polymerase-SMRTbell-adaptor complex was then loaded into zero-mode waveguides (ZMWs). The SMRTbell library was sequenced using 8 SMRT cells (Pacific Biosciences) with C4 chemistry (DNA sequencing Reagent 4.0). 1 × 240 minute movies were captured for each SMRT cell using the PacBio RS II sequencing platform. Long reads were identified by SMRT Analysis v2.2 RS_IsoSeq.1 classify protocol. Then it was clustered all the full-length reads derived from the same isoform and error corrected and polished consensus sequences using LSC-2.0 and TOFU pipeline [24][25]. By short read alignment, 39,170 of 39,364 isoforms were error corrected.

De novo transcriptome assembly and sequence clustering
Raw data of the S. tora transcriptome generated from Illumina HiSeq were preprocessed to remove nonsense sequences including adaptors, primers, and low quality sequences (Phred quality score of less than 20) using NGS QC Toolkit [26]. The raw data were further processed to remove ribosomal RNA using riboPicker v0.4.3 [27]. The preprocessed reads were then assembled using Trinity [28]. Assembly statistics were calculated using in-house Perl scripts. Assembled transcripts were clustered (CD-HIT-EST v4.6.1) [29] in order to reduce sequence redundancy. Sequence identity threshold and alignment coverage (for the shorter sequence) were both set as 90% to generate clusters. Such clustered transcripts are defined as reference transcripts in this work.

Illumina expression quantification and differential expression analysis
The cleaned reads from each tissue were aligned with the transcriptome assembly using Bow-tie2 [30]. The aligned reads were quantified as fragments per million reads (FPKMs) against non-redundant combined transcript sequences (at 90% sequence similarity by CD-HIT-EST). The reads counting of alignments was performed using RSEM (RNA-Seq by Expectation Maximization)-1.2.25 [31]. The differential expression analysis was performed using the DESeq2 packages [32] based on the unigene set of RNA-Seq. Differentially expressed genes (DEGs) were identified using the combined criteria of a more than twofold change and significance with P-value threshold of 0.001 based on the three biological replicates.

Functional annotation and classification
All the assembled unigenes were annotated by BLAST program [33] against the National Center for Biotechnology Information (NCBI) nonredundant (Nr) protein database, the Swiss-Prot protein database, and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways database with an E-value cutoff of 10 −5 . The best aligning results were selected to annotate the unigenes. Whenever the aligning results from different databases conflicted, the results from Swiss-Prot database were preferentially selected, followed by Nr database and KEGG database. Functional categorization by Geno Ontology (GO) terms [34] was carried out by Blast2GO program [35] with E-value threshold of 10 −5 . AgriGO [36] was used to determine over-representation of GO categories (e.g., biological processes).

Identification of transcription factor families
To investigate the putative transcription factor families in S. tora, unigenes were mapped against all the transcription factor protein sequences made available by the Plant Transcription Factor Database (PlantTFDB 4.0; http://planttfdb.cbi.pku.edu/download.php) using BLASTX with E-value threshold of 10 −5 .

Quantitative RT-PCR analysis
Total RNA was extracted by using the RNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA) following the manufacturer's instructions. The quality of the isolated RNA was checked on ethidium bromide-stained agarose gels, and its concentration was calculated according to the measured optical density (OD) of the samples at 260 and 280 nm (DropSense96C Spectrophotometer, Trinean, Belgium). The 1 μg of the total RNA was used for the cDNA synthesis using SuperScript TM III first strand RT-PCR kit (Invitrogen, Carlsbad, CA, USA) with an oligo (dT) 20 primer. After cDNA was obtained from S. tora, qRT-PCR was performed using genespecific primers (S1 Table). Real-time PCR analysis was optimized and performed using the Roche LightCycler 1 480 II instrument and SYBR 1 Green Real-Time PCR Master Mix (Bio-Rad, InS., Hercules, CA, USA) under condition of an initial denaturation at 95˚C for 30 s followed by 40 cycles of denaturation at 95˚C for 10 s, annealing and extending at 55˚C for 15 s. The relative expression of specific genes was quantified using the 2 -ΔΔCt calculation according to the manufacturer's software [37] (where ΔΔC t is the difference in the threshold cycles), and the internal reference gene was the elongation factor 2 for data normalization. Reliability of the amplification parameters was analyzed at 1:15 dilutions of the cDNA samples. The mean threshold cycle values for the genes of interest were calculated from three experimental replicates.

Data availability
The RNA-Seq and Iso-Seq sequences generated from Illumina and PacBio RS II sequencing of four tissue samples of S. tora were deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive database with the accession number SRP159435.

RNA sequencing and de novo transcriptome assembly
De novo transcriptome analysis is a good tool for generating the overall genetic information of an organism without full genome sequencing and leads to discoveries of new genes, molecular markers, and tissue-specific expression patterns. We used the Illumina HiSeq 2500 system and PacBio RS II platform to sequence the cDNA libraries of the leaf, root, and early-and latestages of seed for elucidating secondary metabolites biosynthesis and understanding their spatial and temporal expression pattern in S. tora. Illumina Hiseq 2500 sequencing platform produced 278,031,495 raw reads and averaged 23,169,291 reads per tissue (S2 Table). In total, more than 270 million reads showed high quality read rates (Q30 values) of over 88.00% (S2 Table). The Trinity assembler from the four different libraries generated a total of 118,635 unigenes that were more than 300 base pairs (bp) long (Fig 1). The length of the transcripts varied from 300 to 18,622 bp with an average length of 832.25 bp, the N50 length of 1,082 bp, and the GC content of 39.51% (Table 1). The overall strategy is schematically represented in S1 Fig. A unigene, the assembled transcript that represents a hypothetical gene, can be represented by several isomers as different forms of the same protein. The PacBio RS II sequencing platform produced 768,745 raw reads. After classification and clustering, 118,703 high-quality isoforms were obtained from three different libraries, which contained 39,672, 32,954, and 46,077 high-quality isoforms per library sizes (<2 kb, 2-3 kb, and >3 kb) (S3 Table). The 118,703 high-quality isoforms from three different libraries generated 39,364 non-redundant unigenes after the CD-HIT-EST program removed redundant isoforms. Then, 39,170 of 39,364 unigenes were error corrected by short read alignment. The total size of the assembly was 112 MB with 57% of transcripts larger than 500 bp and 12% larger than 2,000 bp. In total, our analysis  generated two unigene sets: 118,635 from RNA-Seq and 39,364 from Iso-Seq (Fig 1). The two unigene sets showed similar GC contents. However, overall unigene lengths of each set showed that the length of the Iso-Seq was longer than RNA-Seq. Unigenes obtained from Iso-Seq were better in terms of minimum length, average length, and N50 length (Table 1). In our analyses, we used the Iso-Seq unigene set mainly as a reference for RNA-Seq data. Due to other dissimilar characteristics, such as the transcript length between the RNA-Seq and Iso-Seq gene sets, this study did not constitute an integrated unigene set. Later, we plan to create one using the reference-guided method when the S. tora genome sequencing is completed.

Functional annotation and classification
Annotation of function is required to characterize transcripts and understand the complexity and diversity of an organism. For the functional annotation, the assembled 118,635 unigenes obtained from RNA-Seq of leaf, root, early seed, and late seed tissue samples were screened using an FPKM criterion of � 1, which resulted in 56,707 unigenes. To obtain the best annotations, assembled 56,707 RNA-Seq unigene sets and 39,364 Iso-Seq unigene sets of S. tora were aligned with four public protein databases. We used the BLASTX program against NCBI Nr, Swiss-Prot, KEGG, and GO protein databases with an E-value threshold of 1e-5. Annotations of RNA-Seq and Iso-Seq unigenes resulted in the identification of 43,286 and 36,882 unigene sets, which were respectively matched with known proteins. The Venn diagram displays the unique best BLASTX hits from NCBI Nr, Swiss-Prot, KEGG, and GO databases (S2 Fig). The overlapping regions of the four circles indicate the number of unigenes sharing BLASTX similarities in respective databases. The Venn diagram of RNA-Seq showed significant matches: 32,469 to Swiss-Prot (75.01%), 42,552 to NCBI Nr (98.30%), 3,279 to KEGG (7.58%), and 30,287 to GO terms (69.97%). So did the Venn diagram of Iso-Seq: 30,626 to Swiss-Prot (83.04%), 36,830 to NCBI Nr (99.86%), 6,441 to KEGG (17.46%), and 26,762 to GO terms (72.56%). In summary, 43,286 RNA-Seq and 36,882 Iso-Seq unigene sets had at least one significant protein match to these databases. The pattern of annotation of RNA-Seq and Iso-Seq showed that the Iso-Seq is better than RNA-Seq at annotating essential data. Non-significant genes that may represent new genes, non-coding RNA, or RNA representing unnecessary information is not evaluated in this annotaion, and further analysis is required. Matches to the Nr database also indicated that a large number of the S. tora unigenes closely matched the sequences of Glycine max (26.94%), Glycine soja (13.07%), Vigna radiate var. radiata (3.21%), Cicer arietinum (9.38%), and Phaseolus vulgaris (5.63%). Unigenes of 15 species in the Nr database had > 1% match with those of S. tora (S3 Fig).
To further functionally characterize the S. tora transcriptome, we classified the functions of RNA-Seq and Iso-Seq unigenes using GO analysis. The distribution of RNA-Seq and Iso-Seq unigene sets in different GO categories is shown in Fig 2. The three main categories of GO annotations of RNA-Seq included 26,616 GO terms (42.12%) for biological process, 20,211 terms (31.98%) for molecular function, and 16,365 terms (25.90%) for cellular component. Among biological process, organic substance metabolic process (17.00%) and primary metabolic process (16.00%) were the most abundant GO categories. Regarding molecular function, GO terms related to organic cyclic compound binding (19.00%) and heterocylic compound binding (19.00%) were the most abundant, while cell part (22.00%) and cell (22.00%) were the mostly represented GO categories in cellular components. Conversely, the three main categories of GO annotation of Iso-Seq include 57,137 GO terms (45.64%) for biological process, 31,562 terms (25.13%) for molecular function, and 36,876 terms (29.37%) for cellular component. Among biological process, organic substance metabolic process (16.00%) and primary metabolic process (16.00%) were the most abundant GO categories of biological process. The GO terms related to nucleotide binding (16.00%) and nucleoside phosphate binding (16.00%) were the most abundant in molecular function categories. Also, the most abundant GO categories in cellular component were cell part (24.00%) and cell (24.00%). GO terms pattern of RNA-Seq and Iso-Seq was similar in patterns.
Transcription factor (TF) families, including ARF, bHLH, bZIP, C2H2, ERF, MIKC, MYB, NAC, and WRKY, play a key regulatory role in the expression of genes, which are involved in plant secondary metabolism and response to environmental stress, by binding to specific cisregulatory elements of the promoter regions. The number of genes encoding for different TF families varies in different plants to perform species-/tissue-specific or developmental stagespecific function [38]. In our study, 3,284 RNA-Seq and 3,576 Iso-Seq were generated with a total of 6,860 unigenes assigned to 56 TF families. Among these, bHLH (521, 15.86%) were found to be the most abundant in RNA-Seq followed by WRKY (243, 7.40%), C2H2 (189, 5.76%), MYB (177, 5.39%), bZTP (170, 5.18%), and NAC (150, 4.57%). Similarly, in the Iso- Seq, bHLH were found to be the most abundant followed by WRKY, but the other TF families showed a slight ranking change (S4A Fig). Expression of the gene varies depending on the environment in which each species is exposed, and specific or large amounts of the gene are expressed. The degree of expression of the TF family, which mediates and controls their expression, is essential for the molecular genetics of organisms, so in order to investigate tissue specific gene expression in S. tora we studied the expression of genes in leaf, root and early and late seed tissues. Interestingly, different expression patterns for TFs were observed in four tissues of S. tora. Some TFs were unique to each tissue, whereas others were enriched in respective tissues. The 35 and 98 TFs among a total of 133 TFs expressed in leaf, 41 and 97 from 138 TFs in root, 30 and 51 from 81 TFs in early-stage, and 15 and 18 from 33 TFs in late-stage during seed development were tissueenriched and -specific (S4B Fig). Notably, growth regulating factor (GRF) in the TF family was dominantly expressed in late-stage seed tissue (S4 Table). GRFs are plant-specific transcription factors that were originally identified for their roles in stem and leaf development [39]. However, recent studies highlight its importance in other central developmental processes including root development, flower, and seed formation. Expression of GRFs has also been observed in various rice and maize tissues, suggesting their involvement in seed development [40,41].

Differential gene expression analysis during seed development
To compare genes of S. tora with differential expression level in late-stage seed development to those in early-stage development, we used the DESeq method. The transcripts with log2 fold change (FC) >1 and p-value < 1e-3 were considered as differentially expressed genes (DEGs). Pair-wise comparison of transcripts between early-and late-stages of seed development resulted in a total of 14,825 DEGs in RNA-Seq. As seeds matured, 4,935 genes were identified as up-regulated and 9,890 genes were down-regulated. These genes belong to diverse functional groups including glycosyl hydrolases, dehydrogenases, transferases, kinases, phosphatases, cytochrome P450, oxygenases, and hormone-responsive proteins. A heat map was constructed to cluster the top 50 DEGs based on the similarity and diversity of expression profiles using normalized FPKM values within a range of 6 to 16 (Fig 3). Specifically, transcripts of various proteins are expressed differently depending on the tissue and stage of seed. In early-stage seeds, the expression of chalcone synthase, peroxidase, and cell wall/vacuolar inhibitor of fructosidase were higher than those of late-stage seeds. In particular, C/VIF releases glucose and fructose in irreversible reactions, which is essential to plant growth, storage compound accumulation, and stress response [42]. Conversely, in late-seed development, late embryogenesis-abundant (LEA) proteins and heat shock proteins (HSPs) appeared to be more abundant than early seeds like the adlay species [43].
Previously, the expression of genes in leaves, roots, and early-and late-seed tissues were examined to investigate the tissue-specific gene expression of S. tora. During this process the transcripts exhibiting tissue-specific expression were identified and the top 10 transcripts were selected (S5 Fig). Real-time PCR analysis was performed in order to accurately identify differential expression of selected transcripts in the data. Expression analysis was carried out from the selected genes belonging to carbohydrate mechanism, the secondary metabolite pathway, and the associated transcription factors (Fig 4). These results were consistent with tissue-specific gene expression data in various tissues. As results, 3 genes were identified in the qRT-PCR of the seeds to be specifically expressed compared to other tissues. Cell wall/vacuolar inhibitor of fructosidase 2(C/VIF2) play important roles in carbohydrate metabolism, stress responses, and sugar signaling. The specific expression of C/VIF2 in early seeds is implicated in several mechanisms of maturation. Cytochrome P450 83B1 genes showed the highest expression levels in leaf, followed by root, late seed, and early seed. Cytochrome P450 83B1 protein is known to be involved in auxin homeostasis and glucosinolate biosynthesis associated with plant growth and pathogenic responds [44]. Also, seed biotin-containing protein gene showed the highest expression levels in late seed, followed by early seed, demonstrating that the protein plays an important role in the developmental stage of the seed. And organic-cation/carnitine transporter 1 protein gene expressed high levels in root, followed by leaf and late seed. Organic-cation/carnitine transporter families are generally characterized as polyspecific transporters involved in the homeostasis of solutes in animals [45]. Although some publications have suggested that this protein is known as stress-regulated member of plants and that it is involved in plant growth [45], little is known about the function, localization, and regulation of plants. To determine the biological function of DEGs during seed development, GO classification analysis was carried out using Blast2GO. The results showed that 25 functional groups, including 3 major ontologies, classified 63,192 GO terms annotated by the GO database: biological process, cellular component, and molecular function. Many of these DEGs were dominant catalytic activity, binding metabolism, cellular processes, cell parts and cells (S6 Fig). In confirming whether there is specificity for development of seeds in relation to their transcripts, orthologous S. tora genes were applied to gene ontology enrichment analysis using the AgriGO program. In molecular function of GO ontologies, the level of binding function was increased in the up-regulated DEGs. Among them, RNA binding increased to a very high level. In addition, down-regulated DEGs showed an increase in the catalytic activity function, and they also increased protein kinase activity, transferase activity, and microtubule motor activity (S6 Fig). To identify specific metabolic pathways that are responsible for the transcriptional changes of enzymatic genes during seed development of S. tora, we performed MapMan analysis with the expression data of genes showing at least 2-fold differential expression between seed developmental stages. We made the figure to depict the biological processes of interest, and display log2-normalized expression counts onto pictorial diagrams. Most of the genes in cell metabolism are involved in cell wall metabolism, lipid metabolism, carbohydrate metabolism, and secondary metabolism. The dynamic changes in metabolic pathways during seed development were provided in Fig 5, in which we identified the downward trend of overall transcription in the seed development process. In particular, it was clear that lipid metabolism, precursor synthesis, flavonoid metabolisms, and phenylpropanoids/phenolics metabolisms were down-regulated, while the FA synthesis of lipid metabolism and the N-msc of secondary metabolism were up-regulated.

Candidate gene families involved in anthraquinones biosynthesis
S. tora is well known for its various therapeutic effects (e.g., for its anti-hypertensive, diuretic, anti-cancer, anti-microbial and cholesterol-lowering effects). Each effect is caused by various secondary metabolites produced in S. tora, the best known of these being anthraquinone. The biosynthesis of anthraquinone shares isochorismate pathways with phenylpropanoid and shares MEP/DOXP, MEV, and shikimate pathways with carotenoid and flavonoid. In addition, the polyketide pathway is an important part of the anthraquinone biosynthesis. To analyze the active biosynthesis of anthraquinones, we determined the contents of seven compounds of the anthraquinone biosynthesis pathway in early-and late-seed tissues. As seeds matured, anthraquinone compounds were more accumulated in late seed than early seed ( Fig  6 and Table 2). Among the seven compounds, gluco-obtusifolin has the highest content in seed tissues (Fig 6 and Table 2). It is well known that aurantio-obtusin is the most significant active compound [8] and is distributed mainly in the seed [46]. However, we found that low levels of aurantio-obtusin were observed at the early and late developmental stages. A possible explanation for this reason is that aurantio-obtusin may accumulate mainly in the matured and/or dry seed.
To observe gene expression levels of each parts and to compare the changes in gene expression levels between different parts, their levels were normalized to the FPKM (reads per kilobase of exon model per million mapped reads), and transcripts were hierarchically clustered based on the Log2(FPKM+1), allowing us to observe the overall gene expression pattern (Fig  7). In our study, there were 337 RNA-Seq and 212 Iso-Seq genes involved in S. tora secondary metabolites, and they were classified into five pathways including the MEP/DOXP, MEV, shikimate, carotenoid, and flavonoid/polyketide (Fig 7 and S5 Table). There were 35 RNA-Seq and 24 Iso-Seq genes in S. tora for seven enzymes involved in MEP/DOXP pathway and mevalonate pathway leading to production of precursor dimethylallyl disphosphate (Fig 7 and S5  Table). They are also involved in the shikimate pathway leading to the production of precursor 1,4-dihydroxy-2-napthoyl-CoA including 40 RNA-Seq and 31 Iso-Seq genes for 9 enzymes (DAHPS, DHQS, DHQD/SDH, SMK, EPSP, CS, ICS, MenE, and MenB). In MEP/DOXP, 13 DXPS (1-deoxy-D -xylulose-5-phosphate synthase, EC 2.2.1.7) were expressed in anthraquinone synthesis. In them, DN49358_C0_g1 was expressed in large amounts up to the early stage of seed, but appeared to be greatly reduced by the late stage. This gene was also expressed at high levels in leaf and root tissues. Furthermore, DN27315_c0_g1 demonstrated higher levels of gene expression in leaf than in other tissues. And only three of the 13 DXPS genes showed high levels of expression independent of tissue and seed development. ISPD, CDPMEK, and ISPF genes were identified in only 1 and 2, while HDS and HDR were identified in more frequent. HDS and HDR were identified in genes 8 and 6, and HDS ((E)-4-hydroxy-  Anthraquinones are also known to be produced from acetyl-CoA and malonyl-CoA through polyketide pathway in plants. Chalcone synthase (CHS), a type III polyketide synthase, is an important enzyme involved in the polyketide pathway [47]. We have identified 27 RNA-Seq and 23 Iso-Seq genes encoding for enzyme involved in type III polyketide synthase (S5 Table). As a ubiquitous enzyme in higher plants, CHS is known to produce flavonoids by catalyzing the sequential decarboxylative reaction with 3 malonyl-CoA and p-coumaroyl-CoA as a starter and extender unit, respectively [48]. It was also suggested that polyketide synthase could form an anthraquinone precursor using acetyl-CoA and malonyl-CoA. And the formed precursor, octaketide is cyclized by PKC-encoding polyketide cyclase, and usually forms three-ring structures named A, B, and C rings [49]. The formed intermediate is modified by P450 to produce anthraquinone or emodin anthrone, and also to produce sennoside by modification of glycosyltransferases. These 27 PKS gene sizes averaged 584.03 bp, and the longest was 1,580 bp. Among them, only 3 genes (DN50459_c0_g1, DN2403_c0_ g2, and DN50459_c0_g2) showed high levels of expression change in seed development. It seems that these genes are changing a lot in order to make the backbones of the flavonoid and carotenoid components needed for survival in the later stages of seed development. In particular, 5 genes (DN17347_c0_g1, DN50624_c4_g3, DN69520_c0_g1, DN50624_c4_g1, and DN50624_c4_g4) showed a large amount of expression in the early part of the seed, whereas in the latter part, the level of expression decreased sharply, suggesting that those genes play a very important role in the biosynthesis of the backbone of the material needed in early seed development.
In general, glycosylation is carried out at the end of secondary metabolites biosynthesis and improve the solubility and stability of the secondary metabolites. In nature, UDP-glycosyltransferases (UGT) normally facilitates glycosylation, and makes the natural product with glucose at the hydroxyl group [50]. In our study, there were 59 genes in seed stage of S. tora. Based on the results, 33 out of 59 genes showed more expression at the late-seed than at the early-seed stage, whereas 26 showed more expression at the early-seed stage (Fig 7 and S5  Table). The degree of expression of the seven genes (DN131354_c0_g1, DN67413_c0_g1, DN49988_c0_g2, DN50503_c0_g2, DN82643_c0_g1, DN17331_c0_g2, and DN137099_c0_ g1) seems to increase rapidly during the growth of the seed, which seems to be necessary for the process of stockpiling the energy required for seed germination. In addition, DN17331_ c0_g2 and DN82643_c0_g1 seem to have a great effect on the glycosylation during seed development because they undergo a significant amount of change. Conversely, the expression level of the four genes (DN50189_c2_g1, DN11235_c0_g1, DN62590_c0_g1, and DN76515_c0_g1) seemed to decrease rapidly, and the remaining 22 genes were found to be expressed with a relatively small decrease.
Supporting information S1 Table. Gene-specific primers used for tissue-specific qRT-PCR.