Transcriptomic Analysis of American Ginseng Seeds during the Dormancy Release Process by RNA-Seq

American ginseng (Panax quinquefolius L.) is an important herb that is cultivated in China, North American, and South Korea. It is propagated from seed, but the seed has deep dormancy characteristics described as morphophysiological dormancy. Two-stage temperature stratification, a warm (15–20°C) and cold (2°C) stratification period of 6 months, has been used successfully for seed dormancy release. However, little is known about the molecular mechanisms of seed dormancy release in the stratification process. In this study, seed development after pollination and seed development in the dormancy release process were investigated in American ginseng. The transcriptome during seed dormancy release was analyzed using RNA-Seq technology and 78,207 unigenes (mean length 531 bp) were generated. Based on similarity searches of public databases, 54,292 of the unigenes (69.4%) were functionally annotated. Further, three digital gene expression (DGE) libraries were sequenced and differences in gene expression at three stages during seed cold stratification were examined. The greatest number of differentially expressed genes occurred in the 90DCS versus 180DCS libraries, while the lowest number of differentially expressed genes occurred in the 135DCS verus 180DCS libraries. GO enrichment analysis revealed that 59, 29, and 39 GO terms were significantly enriched in the biological process, molecular function, and cell component GO categories, respectively. There were 25,190 genes with KEGG pathway annotation in the three DGE libraries and their enrichment pathways were compared. The gene expressions of 30 selected unigenes were validated using quantitative PCR. This study is the first to provide the transcriptome sequences for seed dormancy release in American ginseng, and demonstrates the successful use of DGE profiling data for analyzing transcriptomic variation during dormancy release. These data provide a basis for future researches of seed dormancy in morphophysiological dormancy seeds in non-model plants.


Introduction
American ginseng (Panax quinquefolius L.), one of the most widely used medicinal plants of the Araliaceae family, has been introduced and cultivated in China for over forty years.It is propagated from seed.The embryo in the newly harvested seed is not fully developed (average length 0.5mm), and needs a warm (15-20°C) and cold (2°C) stratification period of 6 to 18 months [1,2,3].Based on the Baskin and Baskin classification theory of seed dormancy [4], ginseng seeds belong to the morphophysiological dormancy (MPD) class, which describes a combination of morphological and physiological dormancy.Many factors affect MPD seed germination, including the stratification period, seeding time and depth, temperature, and spacing [1,5].At present, two methods have been used for seed treatment.One is a conventional method in which the newly harvested seeds are buried in sand outdoors for 18 to 22 months under seasonal changes of temperature.The other method is a two-stage process: a warm stratification stage during which the embryo begins to grow, and a cold stage during which the embryo completes its development and the endogenous dormancy of the seed is overcome.Thomas et al. suggested that it was possible to induce germination by manipulating ginseng seeds through three months each of warm (15°C) and cold (2°C) temperatures [1].So far, although the two-stage ginseng seed stratification technology has been widely used [1,6,7], the internal molecular mechanism of seed dormancy release is still unclear.
Mining the functional genes, especially those that encode the key enzymes involved in ginsenoside biosynthesis in American ginseng has become a focus of attention in recent years.Chen et al. [8] have generated ESTs (expressed sequence tags) that will be useful for studying functional genomics in American ginseng, and which can be applied to molecular modification of the ginsenoside biosynthetic pathway.Searching public databases is another method that has been used to identify functional genes for cloning [9].Over the last few years, high-throughput sequencing, such as 454-sequencing and RNA-sequencing (RNA-Seq), has provided new strategies for analyzing the functional complexity of transcriptomes and the sequencing data that are now available have opened up opportunities for mining functional genes in non-model plants such as American ginseng.Sun et al. and Wu et al. obtained many unigenes by 454 sequencing and identified all the known enzymes involved in ginsenoside backbone synthesis [10][11][12].Qi et al. analyzed the transcriptome data obtained using 454-sequencing in a type of MPD Paris polyphylla seeds [13].RNA-Seq technology based on pyrosequencing is a popular and powerful tool for species that lack reference genome information.RNA-Seq is less costly, more efficient, and allows faster gene discovery and more sensitive and accurate profiles of the transcriptome than microarray analysis or other techniques [13][14][15][16][17][18].To better understand the molecular mechanisms of seed dormancy in ginseng seeds, we used RNA-Seq technology to identify and characterize the expression of a large number of genes, especially those differentially expressed during dormancy release.
The aim of this work was to gain an understanding of the molecular mechanisms associated with seed dormancy release in American ginseng and establish a sound foundation for future molecular studies.For seed transcriptome sequencing, four seed samples were taken to form a sequencing library, two samples at warm stratification (no split and split seeds) and two samples at cold stratification (135 days and 180 days during the cold treatment).For the digital gene expression (DGE) analysis, three seed samples were taken at 90 days (90DAS), 135 days (135DAS) and 180 days (180DAS) after the warm stratification.We sequenced cDNA libraries from American ginseng seed during the dormancy release process and analyzed the expression of different genes in seeds during cold stratification to identify dormancy-related genes with the aim of constructing a database for this species.To our knowledge, this is the first report on the transcriptome profiling of American ginseng seed dormancy using RNA-Seq technology.

Plant materials
American ginseng (Panax quinquefolius L.) was cultivated in an experimental field in the Institute of Medicinal Plant Development, Beijing, China.The flowers were labeled to make it easy to calculate the date after pollination.After pollination, the ovules used to observe embryo development were sampled every seven days for a total of eleven times, therefore, sampling lasted for 77 days.The ovules were fixed using FAA (95% alcohol: acetic acid: formalin: water = 10:1:2:7) and made into tablets for microscopic inspection.Two-stage seed treatment method was employed for dormancy release.American ginseng seeds were stratified for 3 months in a sandbox at a warm temperature (15°C) during which time most of the seeds completed their anatomical development (split seeds), and then cold (2°C) stratification was applied for another 3 months for physiological dormancy release.For transcriptome sequencing, four seed samples were taken at two seed treatment periods, that is, two samples at warm stratification (no split and split seeds) and two samples on Day 135 and 180 during the cold stratification.For the digital gene expression (DGE) analysis, three seed samples were taken separately on Day90, Day 135 and Day 180 during the cold stratification.

RNA extraction, library preparation, and RNA-seq
The seeds were sampled from three biological replications at each stage to produce an independent pool.Total RNA was extracted from seed samples using an RNA Extract kit (BioTeke, Beijing, China).To construct the transcriptome library, the RNA from the four seed samples described above was pooled by mixing in equal quantities.The three DGE libraries consisted of separate RNA extracts from the three cold stratification samples.Each library was pooled by mixing equal quantities of RNA from the three biological replications for each stage.Each pool was sequenced once because RNA-seq data are highly replicable with relatively little technical variation.RNA integrity was confirmed using a 2100 Bioanalyzer (Agilent Technologies Santa Clara, CA, USA).Oligo-(dT) magnetic beads were used to isolate poly-(A) mRNA from total RNA, and mRNA was fragmented in fragmentation buffer.Using these short fragments (200 bp) as templates, random hexamer primers were used to synthesize first-strand cDNA.Second-strand cDNA was synthesized using buffer, dNTPs, RNaseH, and DNA polymerase I. Short double-stranded cDNA fragments were purified with a QiaQuick PCR extraction kit (Qiagen, Venlo, The Netherlands), resolved with EB buffer for end reparation and adding poly (A), then ligated to sequencing adapters.After purification via agarose gel electrophoresis, suitable fragments were enriched by PCR amplification before library sequencing on an Illumina HiSeq 2000 platform (San Diego, CA, USA).

De novo assembly and function annotation
The raw reads were saved in fastq format, and the dirty raw reads were removed prior to analyzing the data.Three criteria were used to filter out dirty raw reads: adaptors, more than 5% "N" bases, low-quality reads with more than 20% of the bases having a quality value 10.All subsequent analyses were based on clean reads.De novo assembly of the clean reads was performed using Trinity (release 20110713) [19].The command-line parameters were as follows:-seqType fq,-min_contig_length 100,-min_glue 3,-group paris_distance 250,-path_reinforcement_distance 85, and-min_kmer_cov 2. In this paper, the assembled reads are referred to as unigenes.All non-redundant transcripts were used in searches against the NR, Swiss-Prot, KEGG, and COG databases using the BLASTALL package with the significant threshold E-value 10 −5 .The annotation for each known gene from the best BLASTX hit was parsed and assigned to the corresponding transcript.For transcripts with no hits to any of the searched databases, ESTScan [20] was used to predict the coding region and decide the sequence direction.Blast2GO [21] and WEGO [22] were used to assign GO annotations to the transcripts for functional classification.

DGE library preparation, sequencing, and bioinformatics analysis
Total RNA \from each of the three cold stratification seed stages, 90DAS, 135DAS, and 180DAS, were used to construct three tag libraries.The libraries were prepared in parallel for DGE sequencing using the Digital Gene Expression Tag Profile Kit (Illumina).The detailed process was as described by Meng et al. [23].Reads with adaptors, reads in which unknown bases were more than 10%, and low quality reads (percentage of the low quality bases (quality value <5) was more than 50% in one read) were removed.The clean reads were mapped to the transcriptome unigene sequences using the SOAPaligner/soap2 (version 2.20) alignment program, allowing a maximum of 2-bp mismatches per read.To estimate the expression levels of the transcripts, the number of uniquely mapped reads for each unigene for the three cold treatment stages was competed and normalized to RPKM values (reads per kilobase per million mapped reads) using the RPKM formula described by Mortazavi et al. [24].The longest mapped read was used to calculate the expression level and coverage when multiple unigenes represented one gene.For genes with missing values in a specific sample the RPKMs were adjusted to 0.001.The RPKM values were then compared pairwise as: 90DAS/135DAS, 90DAS/ 180DAS, and 135DAS/180DAS.Enriched P-values were calculated using the formula described by Audic and Claverie [15], and Liu et al. [16].The P value was used to identify genes expressed differentially between the paired treatments.Significantly DGEs were identified using a FDR (false discovery rate) threshold of 0.001 and a minimum two-fold change.Based on log2transformed RPKM values, the expression patterns of 22,832 DGEs were clustered in Genesis (1.7.6) (http://genome.tugraz.at)using the K-means algorithm and Pearson's correlation distance.

qPCR analysis
Total RNA was isolated from seed, embryo, endosperm, root, stem, and leaf of American ginseng using the RNeasy Plant Kit (BioTeke).Approximately 1 μg of DNase I-treated total RNA from each tissue was converted into single-stranded cDNA using a PrimeScript RT reagent Kit (Takara, Dalian, China).The cDNA products were then diluted 10-fold with deionized water before use as templates for RT-PCR.
The RT-PCR analyses were performed three times with independent RNA samples.Primers for 30 of the transcriptome unigenes were used in the PCR reaction (S1 Table ).18S RNA gene was used as an internal control to assess the differential regulation of the genes.The semiquantitative PCR reaction was performed in a total volume of 25 μl containing 10 pmol of the gene-specific primers, 20 μM of dNTPs mix, 1.5 mM MgCl2, 1× Taq polymerase buffer, and 1 U Taq polymerase (Takara, DaLian, China).The thermal cycling conditions were as follows: initial denaturation at 94°C for 5 min, then 35 cycles of denaturation at 94°C for 30 sec, re-annealing at 55°C for 30 sec, and elongation at 72°C for 30 sec.A final elongation was performed at 72°C for 5 min.Amplicons were separated on 2% agarose gel and stained with ethidium bromide and then documented on a gel-imager.For real-time PCR analysis, each reaction contained 10μl 2×SYBR Premix DimerEraser (Takara, Dalian, China), 1μM each of the forward and reverse primers, and 1μl of template cDNA.The total reaction volum was 20μl.PCR amplification was performed under the following condition: 95°C for 30 sec, followed by 40 cycles of 95°C for 5 sec and 60°C for 30 sec, using CFX96 TM Real-Time system (Bio-Rad, USA).The specificity of real-time PCR primers was confirmed by melting curve (S1 Table ).Relative expression was calculated as 2 -ΔΔCT and normalized to that of the actin gene [25].The real-time PCR analysis were performed three times with independent RNA samples.

Seed development and stratification
The P. quinquefolius zygote was still in a state of rest at 78 days after pollination (DAP).The proembryo was formed gradually after a first asymmetric division at about 10-14 DAP, and the early globular embryo was seen at 21-28 DAP.At 38 DAP, the globular embryo began transfer to the triangular stage embryo and the suspensor began to degenerate; at the same time, the endosperm became fully developed.Then, from 38-58 DAP, the leaf primordium, the heart stage embryo, and the early torpedo stage embryo gradually formed.At the time of harvesting the seeds, the embryos had stopped developing and were in the late torpedo stage at about 68-78 DAP.The brief development stage of an American ginseng seed after pollination is shown in Fig. 1 (a-j).It can be seen that the seed is not fully developed.For anatomical development, the seed needs a warm stratification period after which cotyledons, hypocotyls, radicals, and epicotyls become visible (Fig. 1 (k-o)).As suggested earlier by Thomas et al. [1,2], we performed a warm stratification of about 3-4 month for morphological dormancy release and then a cold stratification of about 3-6 month to overcome endogenous dormancy.

RNA-Seq and de novo assembly
To obtain an overview of the seed transcriptome during the dormancy release process, a cDNA library was generated from RNA isolated from four seed treatments including anatomical development and physiological after-ripening stages, and paired-end sequenced using the Illumina platform.After cleaning and quality checks, approximately 25.86 million clean reads were assembled into 148,244 contigs (Table 1).The mean contig size was 275 bp.Using pairedend joining and gap-filling, these contig were further assembled into 78,207 unique sequences with a mean size of 531 bp.The size distributions of the contigs and unigenes are shown in Fig. 2. The raw transcriptome reads were deposited in the National Center for Biotechnology Information (NCBI) Short Read Archive (accession number: SRR1586196)

Annotation of the P. quinquefolius seed transcriptome
Approximately 54,292 unique sequences were annotated using BLASTX searches (cut-off Evalue 10 −5 ) against public databases: NCBI's non-redundant protein sequence (nr) and nucleotide sequence (nt) databases, Swiss-Prot protein database, Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO) database, and clusters of orthologous groups (COG).A summary of the transcriptome functional annotation is given in Table 2.A total of 23,915 unigenes (30.6%) did not find matches in any of the databases.

Functional classification
We used GO, COG, and KEGG assignments to classify the functions of the predicted unigenes.

DGE library sequencing and mapping to the reference transcriptome database
Three seed sample DGE libraries (90DAS, 135DAS and 180DAS) were sequenced using tRNA-Seq and 8,424,608, 10,034,797, and 8,710,395 clean reads, respectively, were obtained after filtering out the raw reads (Table 3).Among them, 76.06%, 75.16%, and 78.60% of the reads in 90DAS, 135DAS, and 180DAS, respectively, mapped to reference genes and more than 50% of the reads had perfect matches and unique matches to reference genes [29] (Table 3).Sequencing quality was evaluated using saturation analysis and randomness assessment (S1 Fig. ).
For mRNA expression, heterogeneity and redundancy are two significant characteristics.While the majority of mRNAs are expressed at low levels, a small proportion of mRNAs are highly expressed.Therefore, the distribution of unique reads was used to evaluate the normality of the RNA-Seq transcriptome data.As shown in S2 Fig., the distribution of unique reads over the different reads abundance categories showed similar patterns for all three RNA-Seq libraries.The similarity distribution showed that more than 32% of the sequences had a similarity of 80%, while approximately 50% of the hits had a similar range.

Gene expression profiles among three seed dormancy stages
Differences in gene expression among the three seed dormancy stages during the cold stratification were examined, and differentially expressed genes (DEGs) were identified by the pairwise comparisons: 90DAS-VS-135DAS, 90DAS-VS-180DAS, and 135DAS-VS-180DAS.Fig. 5 (a) and (b) showed the distribution of the total number of DEGs up-or down-regulated between two compared stages and their relationship among the different comparisons.3635 and 3037 DEGs were found to be up-and down-regulated separately in the 135DAS to 90DAS comparison.A similar number of DEGs were differentially expressed between 180DAS and 90DAS.The comparison of 180DAS and 135DAS identified 2096 enhanced DEGs and 1774 decreased DEGs, which were less than the former two comparisons, indicating less genes showed differential expression through 135DAS to 180DAS.Only the expression abundance of 368 and 182 DEGs were significantly increased or decreased by at least 2 folds throughout the  cold stratification, indicating that most transcripts were transiently significantly regulated by the switches of stratification process.For example, of 3635 enhanced Unigenes in 135DAS to 90DAS comparison, 1224 Unigenes were increased only in 135DAS, and 2043 Unigenes were overlapped with those in 180DAS to 90DAS comparison, indicating that they still remained higher expression levels in 180DAS after the increase since 135DAS, but no quantitative difference from 135DAS.These differences in gene expression are consistent with the seed dormancy release process.Some reports have shown that gene activity was lowest in the early stages of dormancy and peaked around the time of dormancy break [24,30].In the 90DAS, 135DAS and 180DAS libraries, 21.38%, 17.67%, and 25.58% genes, respectively, found no matches in the nr or GO databases.These genes may be candidates for future investigations.

Functional classification of DEGs during seed dormancy release
We used GO and KEGG assignments to classify the functions of the DEGs identified in the pairwise comparisons of the 90DAS, 135DAS and 180DAS cDNA libraries during seed dormancy release.Of the assigned GO terms, 59, 29, and 39 GO terms were significantly enriched in the cellular component, molecular function, and biological process GO categories respectively (S4  ).Of the genes associated with the significantly enriched KEGG pathways, those involved in steroid biosynthesis, flavonoid biosynthesis, flavone and flavonol biosynthesis, benzoxazinoid biosynthesis, starch and sucrose metabolism, and zeatin biosynthesis might be important candidate genes for American ginseng seeds dormancy release and worthy of further study.Some of the DEGs assigned to pathways that were not significantly enriched, such as indole alkaloid biosynthesis, carotenoid biosynthesis, circadian rhythm, brassinosteroid biosynthesis, and plant hormone signal transduction could also be of interest.

Expression profiles of GA, ABA metabolism and other related genes
Many reports have shown that stratification can lead to increased expression of the gibberellic acid (GA) biosynthesis genes GA20ox and GA3ox, but deceased expression of the GA catabolic gene GA2ox in Arabidopsis seeds [31,32].Additional studies have identified specific genes correlated with dormancy maintenance (NCED, ZEP (zeaxanthin epoxidase) and ABI (ABA insensitive)) and dormancy release via ABA catabolism (CYP707A) [33][34][35].In this work, we evaluated the DGE library using 30 unigenes including those that were associated with GA and ABA, as well as DELLA (a negative regulator of GA signal), ACC (1-aminocyclopropane-1carboxylate synthase), GIGATEA, PICKLE (CHD3-type chromatin-remodeling factor PICK-LE), KAO (Ent-kaurenoic acid oxidase), and CTR(serine/threonine-protein kinase CTR1) and others [36][37][38].For the semi-quantitative PCR analysis, 41 primers were designed for the 30 unigenes (the primers are listed in S1 Table ).Twenty-three of the primers successful amplified 23 of the 30 selected unigenes (Fig. 6).The PCR data for 19 of the 23 amplified (the exceptions were GA2ox2, GA2ox3, GA3ox2, and GA20ox1) were basically consistent with the RNA-Seq data for the 90DAS, 135DAS and 180DAS samples (Table 4).We also analyzed the expression levels of these 23 unigenes on Day 0 and 45 of the warm stratification and in three control tissues (root, stem, and leaf).The results showed that almost all unigenes had no or weak expression in the early stage (0 days) of seed stratification.
Eight genes were successfully performed in real-time PCR.Fig. 7 showed that GA3ox4, GA20ox3, CYP707A4 and KAO2 genes presented increased expression with dormancy release indicating these genes might be positively associated with seed dormancy release.GA2ox participated in the catabolism of biological active GA and ZEP involved in ABA biosynthesis indicating that these two genes might be negatively related with dormancy release [39,40].In this study, the GA2ox3 expression level Day 0 was almost the same as Day90 and Day180 The ZEP expression level also had no significantly difference among the samples.In this work three unigenes were annotated to the dormancy-associated protein gene (DRM) and the mRNA level of the DRM1 gene Day90 and 180(Em) were significantly increased compared with the Day0 (P<0.05).This indicates that DRM activity may be correlated with morphological and physiological dormancy release in American ginseng seeds.GIGANTEA is one of the clock-associated protein that plays an important role in circadian oscillation and flowering-time regulation [41,42].The expression level of GAGANTEA2 was significantly higher in 180 th days' endosperm than other samples (P<0.05).We also found that relative expression levels of GA3ox4, GA20ox3, DRM1, ZEP and KAO2 were basically consistent with the semi-quantitative PCR showing lower levels in Day0 compared with other seed treatment samples.

Conclusion
In this study, we obtained a comprehensive transcriptome of the seed dormancy release process of American ginseng using RNA-Seq technology.We assembled 78,207 unigenes and assigned 54,292 of them an explicit annotation.Many genes predicted to be involved in seed development and dormancy release were identified in this transcriptome.These findings substantially supplement existing sequence resources for American ginseng.Additionally, differentially expressed genes in the three stages of the seed dormancy release process (90DAS, 135DAS, 180DAS) were identified and functionally annotated with COG and KEGG database.These data will provide potential molecular targets for functional studies of genes in American ginseng seeds responding to development and dormancy release.

Fig 3 .
Fig 3. Gene Ontology (GO) categories of the assembled American ginseng unigenes.The unigenes were assigned to the three GO categories: biological process, cellular component, and molecular function.doi:10.1371/journal.pone.0118558.g003

Table 3 .
Summary of read numbers in the RNA-Seq data from American ginseng seeds. doi:10.1371/journal.pone.0118558.t003

Table ) .
The significantly enriched GO terms were almost all general cell functions, such as non-membrane-bounded organelle, intracellular organelle part, structural molecule activity, and oxidoreductase; however, the DEGs assigned the steroid binding, steroid dehydrogenase activity, hormone binding, and auxin transmembrane transporter activity GO terms may be important candidate genes for seed dormancy release studies, although these terms were not significantly enriched.