Analysis of the Transcriptome of Erigeron breviscapus Uncovers Putative Scutellarin and Chlorogenic Acids Biosynthetic Genes and Genetic Markers

Background Erigeron breviscapus (Vant.) Hand-Mazz. is a famous medicinal plant. Scutellarin and chlorogenic acids are the primary active components in this herb. However, the mechanisms of biosynthesis and regulation for scutellarin and chlorogenic acids in E. breviscapus are considerably unknown. In addition, genomic information of this herb is also unavailable. Principal Findings Using Illumina sequencing on GAIIx platform, a total of 64,605,972 raw sequencing reads were generated and assembled into 73,092 non-redundant unigenes. Among them, 44,855 unigenes (61.37%) were annotated in the public databases Nr, Swiss-Prot, KEGG, and COG. The transcripts encoding the known enzymes involved in flavonoids and in chlorogenic acids biosynthesis were discovered in the Illumina dataset. Three candidate cytochrome P450 genes were discovered which might encode flavone 6-hydroase converting apigenin to scutellarein. Furthermore, 4 unigenes encoding the homologues of maize P1 (R2R3-MYB transcription factors) were defined, which might regulate the biosynthesis of scutellarin. Additionally, a total of 11,077 simple sequence repeat (SSR) were identified from 9,255 unigenes. Of SSRs, tri-nucleotide motifs were the most abundant motif. Thirty-six primer pairs for SSRs were randomly selected for validation of the amplification and polymorphism. The result revealed that 34 (94.40%) primer pairs were successfully amplified and 19 (52.78%) primer pairs exhibited polymorphisms. Conclusion Using next generation sequencing (NGS) technology, this study firstly provides abundant genomic data for E. breviscapus. The candidate genes involved in the biosynthesis and transcriptional regulation of scutellarin and chlorogenic acids were obtained in this study. Additionally, a plenty of genetic makers were generated by identification of SSRs, which is a powerful tool for molecular breeding and genetics applications in this herb.

The aglycon of scutellarin, scutellarein, is a kind of flavone (5,6,7,49-tetrahydroxyflavone). In plants, flavones are synthesized at a branch point of the anthocyanidin/proanthocyanidin pathway; however, flavanones are known as the direct precursor of flavones [8]. The biosynthesis of flavone and other flavonoids is well characterized [9], some genes encoding the enzymes involved in the biosynthesis of flavonoids have also have been cloned and identified from E. breviscapus, such as chalcone isomerase (CHI), chalcone synthase (CHS) and flavone synthase II (FSII) gene [10,11,12]. The 6-OH is the most important characteristic of scutellarin, suggested that there must be a flavonoid 6-hydroxylase (F6H) in E. breviscapus, which converts apigenin to scutellarein. Some F6Hs have been characterized for their activities of hydroxylation the 6-C in the A-ring of flavonoids [13,14,15], but F6H in E. breviscapus is still unclear.
E. breviscapus belongs to the family of the Asteraceae and is distributed in the southeastern region of China, mainly in Yunnan, Guizhou, Sichun, and Guangxi Provinces [32]. This herb is commonly used for treating various paralysis and its sequelae [16], and also for relieving exterior syndrome, dispelling the wind and dampness, and removing stagnancy of indigested food in folklore [33]. However, the wild resource of this herb is endangered due to overexploitation [34]. Our research group had accomplished the domestication and cultivation of E. breviscapus. Moreover, based on mass selection, we had bred two varieties with high yields and high quality. The content of scutellarin in the two varieties is more than 3.0% [35], which is at least 4 times higher than that in the wild plants [36]. Unfortunately, available genomic information on this herb is unavailable, especially the biosynthesis and regulation of scutellarin and chlorogenic acids is considerably unknown.
The objective of the present study is to analyze the transcriptome of E. breviscapus using Illumina sequencing on GAIIx platform, and to discover all candidate genes encoding enzymes and putative transcription factors involved in scutellarin and chlorogenic acids biosynthesis. Based on RNA-seq, lots of simple sequence repeats (SSR) markers could be found, this will facilitate marker-assisted breeding of this plant.

Ethics statement
No specific permits were required for the described field studies. No specific permissions were required for these locations and activities. The location is not privately-owned or protected in any way and the field studies did not involve endangered or protected species.
Plant material E. breviscapus cv. ''Qianshan 1'' was grown at the experimental fields of Honghe Qianshan Technology Co., Ltd in Lu-xi County, Yunnan province, southwest of China (24u 369 360N, 103u 499 300E, alt. 1710 m), which is a variety with high scutellarin content and selected from wild germplasm in Lu-xi County [35]. The mature leaves ( Figure S1A) and flowers at full-blooming stage ( Figure S1B) were harvested from one-year-old plants. All of the samples were immediately frozen in liquid nitrogen and stored at 280uC until use.

RNA-Seq library construction and sequencing
Total RNA was extracted from young fresh leaves and different stage of flowers using RNeasy Plant Mini Kit (Qiagen) following by the RNA purification by RNeasy MiniElute Cleanup Kit (Qiagen), according to the manufacture's protocol. Equal amounts of RNA from each sample were mixed together for the subsequent steps of our experiments. For mRNA library construction and deep sequencing, at least 20 mg of total RNA samples were prepared by using the TruSeq RNA Sample Preparation Kit (Illumina) for Illumina sequencing on Genome Analyzer IIx platform at CapitalBio Corporation (Beijing, China). The high quality reads obtained in this study have been deposited in the NCBI SRA database (accession number: SRA111764).

Illumina reads processing and assembly
Transcriptome de novo assembly was carried out using the short read assembly program: Trinity. A Perl program was written to remove low quality reads (reads with a base quality less than 20). Then the high quality reads were de novo assembled by using Trinity program [37,38] at k-mer lengths of 25. Functional annotation and predicted CDS Functional annotations were performed by sequence comparison with public databases. All unigenes were compared with the non-redundant protein database (Nr, http://www.ncbi.nlm.nih. gov/) and the SWISS-PROT database (http://www.expasy.ch/ sprot) using BLSATX (E-value ,1e 25 ) [39] and BLASTX (Evalue ,1e 210 ). Unigenes were also compared with the Clusters of Orthologous Groups of proteins database (COG) (http://www. ncbi.nlm.nih.gov/COG/) [40] and Kyoto Encyclopedia of Genes and Genomes database (KEGG, release 58; http://www.genome. jp/kegg) [41] using BLASTX with an E-value of less than 1e 210 , and then a Perl script was used to retrieve KO (KEGG Orthology) information from BLASTX result and then established pathway associations between unigenes and database. InterPro domains [42] were annotated by InterProScan [43] Release 27.0 and functional assignments were mapped onto Gene Ontology (GO) (http://www.geneontology.org/) [44]. Then we made GO classification and drew GO tree using WEGO (http://wego.genomics. org.cn/cgibin/wego/index.pl) [45].
We predicted the CDS (Coding sequences) using blastx and ESTscan. we first performed BLASTX alignment (E-value,10 25 ) between unigenes and protein databases like Nr, Swiss-Prot, KEGG and COG. The best alignment results were used to determine the sequence direction of unigenes. Unigenes with sequences having matches in one database were not searched further. When a unigene would not align to any database, ESTScan was used to predict coding regions and determine sequence direction.

EST-SSR detection and primer design
Using MISA tool [46] (http://pgrc.ipk-gatersleben.de/misa/), the potential SSR markers with motifs ranging from di-to hexanucleotides in size were detected among the 73,092 unigenes. The minimum of repeat units were set as follows: six for dinucleotides and five for tri-, tetra-, penta-and hexa-nucelotides. Primer pairs flanking each SSR loci were designed using the Primer3 program (http://primer3.ut.ee/).

Survey of EST-SSR polymorphism
A total of 36 primer pairs (File S1) were synthesized and thirteen E. breviscapus accessions (File S2) were selected for polymorphism investigation with the EST-SSRs. Total DNA was isolated from E. breviscapus leaves using the CTAB method. PCR amplifications were conducted in a final volume of 20 mL containing 1 mL 2.5 mM dNTPs,1 mL EasyTaq DNA polymerase (Beijing Trans-Gen Biotech Co., Ltd. China), 2 mL 106EasyTaq buffer, 1 mL of each primer (10 mM), 13 mLddH 2 O and 1 mL template DNA  (approx. 10 ng/mL). PCR was performed as follows: denaturation at 94uC for 2 min, followed by 35 cycles of 30 s at 94uC, 30 s at Tm (annealing temperature), 30 s at 72uC and a final step at 72uC for 5 min. The separation of alleles was performed on a 8% polyacrylamide gel. PCR products were mixed with an equal volume of loading buffer. The mixture was denatured at 95uC for 5 min before loading onto the gel.

Genetic diversity and data analysis
The presence of each single band was coded as 1 and its absence as 0 in a data matrix. The genetic diversity and mean allele number were calculated using Popgene version 1.32 [47]. Scoring data from polymorphic loci were used to calculate the polymorphism information content (PIC) according to the formula of PIC = 1-gpi 2 (pi is the frequency of i th allele for each locus) [48].
By NTSYS pc 2.1 program [49], Jaccard's genetic similarity coefficients were calculated and dendrogram was constructed by the UPGMA (un-weighted pair group method with arithmetic mean) clustering method.

Phylogenetic Analysis
Phylogenetic analysis based on the deduced amino acid sequences of CYPs and R2R3-MYB transcription factors from E. breviscapus and other plants. All of the deduced amino acid sequences were aligned with Clustal X using the default parameters: gap opening penalty, 10; gap extension penalty, 0.1; and delay divergent cutoff, 25%, and evolutionary distances were computed using MEGA5.10 with the Poisson correction method. For the phylogenetic analysis, a neighbor-joining tree was constructed using MEGA5.0. Bootstrap values obtained after 1000 replications are indicated on the branches. The scale represents 0.1 amino acid substitutions per site.

Illumina Sequencing and de novo Assembly
In order to find more genes involved in the biosynthesis of scutellarin and chlorogenic acids, total RNA was extracted from the fully-expanded young fresh leaves and the flowers at different stages. Leaves and flowers are the primary medicinal parts of E. breviscapus. RNA samples from different tissues were mixed in equal proportions and then used for mRNA preparation, fragmentation, and cDNA synthesis. The cDNA was sequenced using the Illumina Genome Analyzer IIx platform and the resulting sequencing data were subjected to bioinformatic analysis. The paired-end sequencing yielded 26100 bp independent reads from either end of the cDNA fragment. In this study, 64,605,972 raw reads were generated from a 200 bp insert library. After trimming adaptor sequences and removing the low quality reads (Q20,20), 62,595,942 clean reads were obtained. Among these clean reads, 93.04% reads were deemed high quality reads, and were selected for further analysis.
Reference genome sequence was unavailable for E. breviscapus, thus all the clean reads (62,595,942) were assembled de novo using Trinity, with optimized k-mer length of 25; all short sequences were assembled into 73,092 unigenes within a length range from 201 to 12,242 bp, comprising a total length of 66,962,717 bp. The average length of all unigenes is 916 bp, and there was 23,844 unigenes (32.62%) longer than 1000 bp. An overview of the sequencing and assembly is outlined in Table 1.
The CDS (coding DNA sequence) in all unigenes were also detected by comparing the sequences to protein databases Nr, SwissProt, KEGG, and COG using BLSATX (E-value ,10 25 ). For the unigene that do not have similarity in all protein databases, ESTScan was used for detecting CDS [50]. Most unigenes (44,223, 60.50%) were identified to have CDS, among them, 33.01% unigenes have CDS longer than 100 bp. The length distributions of the contigs, unigenes, and CDS were shown in Figure 2.
The Illumina platform has been widely used for de novo transcriptome sequencing in many organisms. RNA-seq does not require a reference genome to gain useful transcriptomic information. This makes RNA-Seq technology particularly useful for non-model organisms that often lack genomic sequence data [51]. In this study, these results indicated again that Illumina paired-end sequencing technology is useful for the de novo sequencing and assembly of transcriptome of non-model plant species. As far as we know, this is the first report on a large scale of transcriptome sequencing and analysis in E. breviscapus. These data will contribute significantly to further genome-wide research and analyses in this herb.
We also compared the length of unigenes with the percentage which matches against the sequences in Nr and Swiss-Prot database, and found that the longer unigenes had the higher percentage, especially those longer than 1000 bp, but the unigenes of the length of 600 to 1000 bp had a lower percentage of their matches (File S4). For E-value and similarity distributions in the Nr database, 52.63% unigenes showed a significant similarity (E-value,1e 250 ) and 14.28% unigenes had a high similarity of greater than 80%. In the Swiss-Prot database, the percentages were 44.37% and 9.97%, respectively (File S5).

Gene Ontology Classification
GO assignments were used to classify the functions of all unigenes. Based on sequence similarity, 17,156 unigenes were assigned to one or more ontologies. Totally, 31,527 unigenes were grouped under biological processes, 33,986 unigenes under cellular components, 19,671 unigenes under molecular functions. Binding (8,754 unigenes, 44.50%) and catalytic activity (8,606 unigenes, 43.75%) were the most highly represented groups under the molecular function category. For the biological process class, the assignments were mainly given to the metabolic process (7,714 unigenes, 24.47%) and cellular process (7660 unigenes, 24.30%) (Figure 4; File S6). These GO annotations provide a valuable clue for investigating the specific processes, molecular functions, and cellular structures of E. breviscapus transcriptome.

Candidate genes encoding enzymes involved in scutellarin biosynthesis
F7GAT transfers glucuronic acid from UDP-glucuronic acid (UDPGA) of sugar donor to the 7-OH group of flavonoids of sugar acceptor. F7GAT was firstly purified from S. baicalensis, which specifically uses UDPGA as the sugar donor and glucuronosylates the 7-OH group of flavones with ortho-substituents, such as scutellarein and baicalein [53]. F7GATs isolated from the plants within the Lamiales order are the members of the UGT88-related cluster [6]. In this study, we did not found any ortholog of F7GAT in the transcriptome dataset of E. breviscapus. The reason is not clear.

Transcripts encoding putative flavone-specific MYB transcription factors in E. breviscapus
Since scutellarin is a kind of flavone glucuronide derivative, so we focused on flavone-specific transcription factors (TFs), especially on R2R3-MYB. The R2R3-MYB TFs for anthocyanin and proanthocyanin belong to maize C1 homologues (R2R3-MYB), such as AN2 in petunia [54] and PAP1/PAP2 in Arabidopsis [55]. These R2R3-MYB TFs require interaction with basic helix-loophelix (bHLH) and WD40 repeats (WDR) protein for regulating the structural genes expression in the late biosynthetic pathway. In contrast, the R2R3-MYB TFs regulating flavones, flavonols, and phlobaphene biosynthesis don't need to bind with other type of TFs. Such the R2R3-MYB TFs are the homologues of maize P1 (ZmP1), including AtMYB12, AtMYB11, and AtMYB111 in Arabidopsis [56,57], VvMYBF1 in grapevine [58], and GtMYBP3 and GtMYBP4 in Gentiana triflora [59]. In the transcriptome dataset, a total of 846 unigenes were annotated to MYB TFs, among them, 4 unigenes are highly similar to AtMYB12 or AtMYB111 (File S11). Phylogenetic analysis showed that they all belonged to P1/subgroup 7 (Figure 8), suggesting that those might be the specific TFs for particularly regulate the genes involved in scutellarin biosynthesis. Characterizing the functions of those unigenes will help us deeply understand the regulatory mechanism of scutellarin biosynthesis.  Genes encoding enzymes involved in chlorogenic acids biosynthesis Three pathways have been proposed for the biosynthesis of CGA (Figure 1). Except HCGQT (hydroxycinnamoyl D-glucose: quinate hydroxycinnamoyl transferase), nearly all known enzymes involved in CGA biosynthesis had transcripts in this Illumina dataset, including UGCT (UDP glucose: cinnamate glucosyl transferase), C3H (p-coumarate 3-hydroxylase), HCT (hydroxycinnamoyl CoA shikimate/quinate hydroxycinnamoyl transferase), and HQT (hydroxycinnamoyl CoA quinate hydroxycinnamoyl transferase) ( Table 3; File S9). Because of more transcripts encoding enzymes in pathway 2 and 3 (Figure 1), they are proposed to be the main biosynthesis pathways of CGA in E. breviscapus. The inhibiting of the expression of those genes by antisense RNA technology or RNA interference might reduce the contents of CGA and increase the content of scutellarin in this herb. Furthermore, the gene encoding enzyme which catalyzes CQA for the synthesis of di-CQAs is presumed to be the homologues of HQT gene, because they all transfer caffeoyl ( Figure 1). In this study, we found that 8 transcripts were annotated to HQT (Table 3). Functional analysis of the unigenes will help us find the gene and enzyme that catalyzes the final step of di-CQA biosynthesis.

EST-SSR Discovery: Distribution and Frequencies
Among molecular markers, SSRs are the most favored genetic markers because of their relative abundance, and they have been widely applied for molecular-assisted selection (MAS) in plant breeding programs [60]. All the unigenes obtained were searched in order to determine the frequency and distribution of SSRs. The potential SSRs were detected in all of the 73,092 unigenes using MISA software. A total of 11,077 SSRs were identified in 9,255 unigenes. Of all the SSR containing unigenes, 1,431 sequences contained more than one SSR and 848 SSRs were present in compound form ( Table 4). The information of SSR derived from all unigene was shown in File S12. On average, we found 1.38 SSR per 10 Kb in this study, this is similar to the frequency in Apium graveolens (1 SSR per 10 KB) [61], but different from the frequencies in other plants [62,63]. Among 11,077 SSRs identified, the tri-nucleotide repeat motifs were the most abundant types (45.19%), followed by di-(42.66%), tetra-(8.54%), hexa-(2.17%), and penta-nucleotide (1.44%) repeat motifs. SSRs with six tandem repeats (29.83%) were the most common, followed by five tandem repeats (29.05%), seven tandem repeats (15.02%), and four tandem repeats (10.03%) ( Table 5). The most common type of di-nucleotide was AT/AT that accounted for 23.43% of the repeats, followed by AC/GT (11.95%) and AG/CT (7.28%). Among the tri-nucleotide repeats, both of AAT/ATT and ATC/ ATG were the most frequent motifs (11.10%) (File S13). Using Primer3, total 20,865 pairs of primers were designed (File S14). These unique sequence-derived markers obtained in this study represent a valuable genetic resource for SSR mining and applications.

EST-SSR marker polymorphism
In this study, 36 primer pairs were randomly designed to validate the amplification and polymorphism in thirteen E. breviscapus accessions. 34 (94.40%) of the primer pairs successfully amplified fragments and 26 pairs (72.22%) produced PCR amplicons at the expected size. Among the successful primer pairs, 19 pairs (52.78%) exhibited polymorphisms among the thirteen E. breviscapus accessions (File S15). Polymorphism infor- Table 5. Distribution of identified SSRs using the MISA software. The polymorphic SSR markers were then used to perform genetic correlation analysis among the 13 E. breviscapus accessions. The UPGMA clustering produced a dendrogram (Figure 9) that separated the 13 E. breviscapus accessions into three groups at the level of genetic similarity 0.93. All local species (YX-1,YX-2,YX-3,and YX-3) formed a cluster. Within the cultivated species, two lines (QS-3 and QS-4) formed a sub-cluster, and the other two lines (QS-1 and QS-2) did not cluster together with other cultivated species. Almost all wild species (MDY-1, MDY-2 and MDY-3) formed a cluster. It is noteworthy that two wild species (MD-1 and MD-2) formed a cluster with the other two cultivated species (QS-3 and QS-4), thus indicated that they may have a similar genetic background. These results demonstrate that SSRs can distinguish varieties without morphological diversities, and they will be a powerful tool for molecular breeding and genetics applications in this herb.

Conclusions
E. breviscapus has been used as an herb for more than several centuries, but it was cultivated about only ten years. Thus, genomic information is urgently needed for molecular breeding. For a non-model plant which cannot be easily accessed for whole genome sequencing, the analyses of the transcripts provide an efficient and cost-effective way for genome exploring. This is the first attempt to elucidate the transcriptome of E. breviscapus using Illumina next-generation sequencing and de novo assembly. This study provided a large number of transcripts for gene function analysis and new gene discovery in this herb, especially for the candidate genes of flavone 6-hydroase and the transcription factors which might regulate scutellarin biosynthesis. The genes identified in this study will help to decipher the molecular mechanisms of secondary metabolism. This study will also contribute to the improvements on this species through marker-assisted breeding or genetic engineering.  File S11 MYB transcription factors in E. breviscapus unigenes.

(XLSX)
File S12 Information of SSR derived from all unigene.

(XLSX)
File S13 Frequency distribution of SSRs based on motif types.