The First Illumina-Based De Novo Transcriptome Sequencing and Analysis of Safflower Flowers

Background The safflower, Carthamus tinctorius L., is a worldwide oil crop, and its flowers, which have a high flavonoid content, are an important medicinal resource against cardiovascular disease in traditional medicine. Because the safflower has a large and complex genome, the development of its genomic resources has been delayed. Second-generation Illumina sequencing is now an efficient route for generating an enormous volume of sequences that can represent a large number of genes and their expression levels. Methodology/Principal Findings To investigate the genes and pathways that might control flavonoids and other secondary metabolites in the safflower, we used Illumina sequencing to perform a de novo assembly of the safflower tubular flower tissue transcriptome. We obtained a total of 4.69 Gb in clean nucleotides comprising 52,119,104 clean sequencing reads, 195,320 contigs, and 120,778 unigenes. Based on similarity searches with known proteins, we annotated 70,342 of the unigenes (about 58% of the identified unigenes) with cut-off E-values of 10−5. In total, 21,943 of the safflower unigenes were found to have COG classifications, and BLAST2GO assigned 26,332 of the unigenes to 1,754 GO term annotations. In addition, we assigned 30,203 of the unigenes to 121 KEGG pathways. When we focused on genes identified as contributing to flavonoid biosynthesis and the biosynthesis of unsaturated fatty acids, which are important pathways that control flower and seed quality, respectively, we found that these genes were fairly well conserved in the safflower genome compared to those of other plants. Conclusions/Significance Our study provides abundant genomic data for Carthamus tinctorius L. and offers comprehensive sequence resources for studying the safflower. We believe that these transcriptome datasets will serve as an important public information platform to accelerate studies of the safflower genome, and may help us define the mechanisms of flower tissue-specific and secondary metabolism in this non-model plant.


Introduction
The safflower, Carthamus tinctorius L., is a member of the Compositae or Asteraceae plant family, and is cultivated mainly for its seeds and flowers. The safflower plant yields a variety of products, including the numerous secondary metabolites of its flowers and the unsaturated fatty acids in its seeds, many of which are very beneficial to human health. Although the safflower is a minor crop, it is one of the oldest crops in human history, and it is cultivated worldwide; in 2007, safflower cultivation accounted for , 795,118 ha across the globe [1]. Traditionally, this crop was grown for its flowers, which were a source of yellow and red dyes for clothing and food. Safflower oil first was first used in the paint industry, but due to its health benefits, it is now produced commercially for cooking oil and other uses [2,3,4,5].
In plants, primary metabolites are organic compounds that are directly involved in normal growth, development and reproduction. In contrast, secondary metabolites often play important roles in defending the plant against herbivores and diseases. The flavonoids are a large group of secondary metabolites that humans have long used for medicines [6,7], toning, and other purposes. Safflower flavonoid extracts have been clinically used for the prevention and treatment of cardiovascular disease in China [6,8] and Japan for centuries. The safflower is a native plant whose flowers have long been processed into a Chinese herbal medicine called Chuanhonghua in China Jianyang country. In China, safflower is mainly cultivated for the medicinal properties of its flower tissues. To date, , 50 secondary metabolites have been identified in safflower flower extracts; most of them are generalized flavonoids, a class of modulators that mediate bifunctional interactions at vicinal ATP-and steroid-binding sites [9]. These include carthamone, safflor yellow A, hydroxysafflor yellow A, 6hydroxykaempferol, 6-hydroxykaempferol-3-o-b-D-glucoside, kaempferol [10,11,12], and others. The abundance of flavonoids as secondary metabolites makes the safflower a good model for investigating flavonoid biosynthesis in plants, and the related genes and pathways. This should provide a better understanding of the genes responsible for regulating natural products, and could help us improve the use of these natural products for human health.
In recent years, researchers have focused on examining the genetic diversity of the safflower. For example, Pahlavani et al. reported on the inheritance of flower color and spininess [13], while Chapman et al. reported on the DNA sequence diversity and origins of the cultivated safflower, along with its development and polymorphisms, and the cross-taxon utility of EST-SSR markers [14,15,16]. More recently, Li et al. found that there are at least 236 known microRNAs (miRNAs) expressed in the safflower, 100 of which are conserved across plants [17]. To date, however, the safflower genome has not been fully sequenced, and relatively little is known about its encoded genes. As of October 2011, only 567 nucleotide sequences, 41,588 expressed sequence tags (ESTs), 162 proteins and 0 genes from Carthamus tinctorius had been deposited in the NCBI's GenBank database.
EST sequencing has traditionally been the core technology used for the discovery of reference transcripts. However, it has some inherent limitations, such as low throughput, high cost and a long experimental cycle. In recent years, researchers have developed a high-throughput sequencing technology called Next Generation Sequencing (NGS) [18]. Various platforms utilize NGS, such as the Illumina Genome Analyzer, the Roche/454 Genome Sequencer FLX Instrument, and the ABI SOLiD System; these have proven to be powerful and cost-effective tools for advanced research in many areas, including genome sequencing, genome resequencing, miRNA expression profiling, DNA methylation analysis, and especially the de novo transcriptome sequencing of non-model organisms [19,20]. This method of transcriptome analysis is fast and simple because it does not require bacterial cloning of the cDNAs. Instead, direct cDNA sequencing generates an extraordinary depth of short reads. It is a more comprehensive and efficient way to measure transcriptome composition, obtain RNA expression patterns, and discover new genes. In addition, this approach is very sensitive, and thus allows the detection of low-abundance transcripts. Recent transcriptomic studies on Arabidopsis thaliana [21], mouse [22], and human [23] cells have demonstrated that this approach is well-suited for surveying the complexity of eukaryotic transcriptomes by de novo assembly.
In this study, we used NGS technology to survey the poly (A) + transcriptome of Carthamus tinctorius L. flower tissues. The coverage of the transcriptome was , 4.7 Gb of clean nucleotides, and thus comprehensive enough to discover all known genes and major metabolic pathways in this transcriptome. This dataset will serve as a public information platform for gene expression, genomics, and functional genomics in Carthamus tinctorius L. The assembled and annotated transcriptome sequences provide a valuable resource for identifying most of the genes expressed in the safflower.

Flower Transcriptome Sequencing Output and de novo Assembly
To comprehensively cover the safflower flower transcriptome, total RNA was extracted from the tubular tissues of red/orange and white flowers. The phenotypes of each color are shown in Figure 1A. Flowers were collected on the 1st, 2nd and 3rd days of the flowering stage. Equal amounts of total RNA from each sample were pooled together, and mRNA was isolated, enriched, sheared into smaller fragments, and reverse-transcribed into cDNA. The cDNA were subjected to Illumina HiSeq TM 2000 sequencing, and the resulting sequencing data were subjected to bioinformatic analysis ( Figure 1B-D). After removal of adaptor sequences, ambiguous reads and low-quality reads (Q20,20), we obtained a total of 52,119,104 clean reads of 75-bp long comprising 4,690,719,360 nucleotides (4.69 Gb) of high-quality clean reads. An overview of the sequencing and assembly is outlined in Table 1. All high-quality reads were assembled de novo using the Trinity program [24], which produced 195,230 contigs, with an N50 of 290 bp (i.e. 50% of the assembled bases were incorporated into contigs of 290 bp or longer).
The size distribution of the safflower contigs is shown in Figure 2A. A total of 195,230 contigs were assembled; the mean length was 255 bp, and although the majority of the contigs were less than 200 bp, 41,682 of the contigs were greater than 500 bp in length. A total of 120,778 unigenes were assembled, with an average unigene length of 446 bp and an N50 of 555 bp (the size distribution is shown in Figure 2B). Of the 120,778 unigenes, 49,376 were longer than 500 bp; 10,543 were longer than 1,000 bp; and 1,600 were longer than 1,500 bp. Although the unigene distribution closely followed the contig distribution, with the majority being shorter sequences, we obtained much longer unigenes compared to contigs (Figure 2A and 2B). The longest unigene recovered in this work was Unigene1325_safflower, which is 3,967 bp and encodes a splicing factor 3B subunit 1 (according to SwissProt annotation).
For our analyses of coding sequences (CDS), predicted proteins, and gene annotations, all assembled sequences were first searched against the plant proteins found in the NCBI non-redundant database (Nr), and then against the Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) and Orthologous Groups of proteins (COG) [25] databases, using the BLASTX program (E-value ,10 25 ). We obtained a total of 70,342 significant BLAST hits (58.24% of all unigenes). The size distribution for the CDS and predicted proteins are shown in Figure 2C and 2D. The CDS of the unigenes that did not have BLAST hits were predicted using ESTScan and then translated into peptide sequences [26]; 6,623 unigenes were analyzed using this method (the size distributions of the ESTs and proteins are shown in Figure 2E and 2F).

Functional Annotation
Unigene annotations provide functional information, including protein sequence similarities, COG clusters, gene ontology (GO) and KEGG pathway information. We searched the safflower unigene sequences against the protein databases (Nr, SwissProt, KEGG and COG) using BLASTX (E-value ,10 25 ) and predicted the protein functions from the annotations of the most similar proteins. Distinct gene sequences were first searched using BLASTX against the Nr database, and 67,980 unigenes (56.3% of all unigenes) had hits that exceeded the E-value cutoff. Similarly, 49,260 unigenes (40.8%) were identified from the SwissProt database. A total of 70,342 unigenes were annotated in one or more of the databases (about 58% of all assembled unigenes), suggesting they have relatively well conserved functions.
GO is an international standardized gene-function classification system that uses a dynamically updated, controlled vocabulary and a strictly defined concept to comprehensively describe the properties of genes and their products in any organism. The GO database comprises three ontologies: molecular function, cellular components and biological processes. The basic units of GO are the ''GO terms,'' which each belong to a type of ontology. Here, we obtained the GO functional annotations of the safflower unigenes with Nr annotations by BLAST2GO program [27], and then used the WEGO software [28] to perform GO functional classifications for all of the unigenes and to examine the macrolevel distribution of gene functions for this species. A total of 26,332 safflower unigenes were assigned 1,754 GO-term annotations using BLAST2GO, and the terms were summarized into the three main GO categories and 44 sub-categories (functional groups) ( Figure 3B). In each of the three main categories of the GO classification (biological processes, cellular components and molecular functions), the dominant terms were ''cellular processes,'' ''metabolic processes,'' ''cells,'' ''cell parts,'' ''organelles,'' ''binding,'' and ''catalytic activity.'' More than half of the genes fell  within the ''biological processes'' category. A high-percentage of the genes in the biological processes category fell under ''cellular processes'' and ''metabolic processes;'' a high percentage of the genes in the cellular components category fell under ''cells,'' ''cell parts,'' and ''organelles''; and ''binding'' and ''catalytic activity'' dominated in the molecular function category ( Figure 3B). Pathway-based analysis can help us further understand the biological functions of genes. The KEGG pathway database contains information on networks of intracellular molecular interactions, and their organism-specific variations [29]. To identify the biological pathways in the safflower, we mapped the annotated sequences to the reference canonical pathways contained in the KEGG database. In total, we assigned 30,203 unigenes to 121 KEGG pathways (File S1). The most highly represented category was ''metabolic pathways,'' with 6,948 members. The ''biosynthesis of secondary metabolites'' and ''plant-pathogen interaction'' pathways were also well represented, with 3,598 members and 2,348 members, respectively. Many genes corresponded to pathways involved in the biosynthesis of secondary metabolites, including ''flavonoid biosynthesis'' (274 genes), ''flavone and flavonol biosynthesis'' (114 genes), ''isoquinoline alkaloid biosynthesis'' (77 genes), ''tropane, piperidine and pyridine alkaloid biosynthesis'' (72 genes), ''glucosinolate biosynthesis'' (60 genes), ''synthesis and degradation of ketone bodies'' (35 genes), ''indole alkaloid biosynthesis'' (22 genes), ''anthocyanin biosynthesis'' (20 genes), and ''betalain biosynthesis'' (4 genes). Plants often produce secondary metabolites for pathogen defense, so it is unsurprising that 2,348 unigenes were mapped to pathways involved in plant-pathogen interactions, such as ''phagosomes'' (422) and ''natural killer cell-mediated cytotoxicity'' (126). In addition, many genes were related to ''fatty acid biosynthesis and metabolism''; these pathways included ''alpha-linolenic acid metabolism'' (268), ''biosynthesis of unsaturated fatty acids'' (216), ''fatty acid metabolism'' (204), ''fatty acid biosynthesis'' (143), ''linoleic acid metabolism'' (105) and ''arachidonic acid metabolism'' (48). Finally, some genes also distributed to the plant hormone biosynthesis pathways, including ''zeatin biosynthesis'' (361) and ''brassinosteroid biosynthesis'' (28).

Analysis of Flavonoid Biosynthesis-related Pathways and Genes in the Safflower
The flavonoids include flavanone, flavones, flavonols, quinochalcone and their derivatives. Because the pharmacological effects of safflower flavonoids, we focused on the related genes in this study. In the safflower, quinochalcone is the product of a chalcone, which is oxidized in the A ring and then substituted by glucoses in one or more positions to achieve the final structure. In our annotated safflower flower tissue transcriptome datasets, we identified multiple unigenes encoding enzymes involved in known flavonoid biosynthesis pathways. Their sequencing depth was more than 10-fold coverages (Figure 4, File S2).
The main pathway of flavonoid biosynthesis is well understood, and it is known to be fairly well conserved among plants [30,31,32]. At the beginning of this process, one molecule of 4coumaroyl-CoA and three molecules of Malonyl-CoA are catalyzed by the polyketide synthase, chalcone synthase (CHS, EC 2.3.1.74, in KEGG database), into 4,29,49,69-tetrahydroxychalcone. CHS is the first committed enzyme in this pathway. Five safflower unigenes and two hypothetical unigenes were identified as CHIs; four of them were more than 1000 bp in length. The intermediate, 4,29,49,69-tetrahydroxychalcone, is key for the biosynthesis of both quinochalcones and flavonoids; it may be hydroxylated and glycosylated, leading to the production of carthamone (in red flowers) and hydroxysafflor yellow A. Hydroxysafflor yellow A, which was first identified in 1993 by Meselhy et al. [33], is the main component of the red/orange color of their flowers, and it is the most widely studied chemical in the safflower. Previous studies have shown that hydroxysafflor yellow A has the following properties: it can enhance the survival of vascular endothelial cells under hypoxia [34]; it can protect against LPS-induced acute lung injury, which is thought to be associated with the inhibition of p38 MAPK, the activation of NF-kappaB p65, and changes in inflammatory cytokine expression [35]; it  might be a promising antifibrotic agent in chronic liver disease [36]; it can significantly inhibit the neuronal damage induced by exposure to glutamate and sodium cyanide (NaCN) in cultured fetal cortical cells [37]; it can protect against mitochondrial injuries in rat cortexes induced by cerebral ischemia [38]; and it can protect human umbilical vein endothelial cells from hypoxiainduced injuries by inhibiting apoptosis and cell cycle arrest [39]. As a substrate, hydroxysafflor yellow A can be catalyzed to carthamin and safflor yellow A. One study showed that injection of safflor yellow A is a safe and effective treatment for coronary heart disease angina pectoris (CHD-AP) with Xin-blood stagnation syndrome (XBSS) [40]. The enzymes responsible for catalyzing 4,29,49,69-tetrahydroxychalcone to quinochalcones are not well understood at this time, so they were not assessed in the present study.
Kaempferol can also be converted to quercetin by flavonoid 39hydroxylase (F39H, also called flavonoid 39-monooxygenase; EC 1.14.13.21) or flavonoid 39, 59-hydroxylase (F3959H; EC 1.14.13.88) by adding a hydroxyl to the 39 position of ring B. F39H was found to be the largest family of safflower flavonoid biosynthesis enzymes, comprising 39 members ranging from 165 bp to 1,456 bp. In contrast, only three F3959H unigenes were identified. In the next step of this pathway, 3GT can catalyze quercetin to quercetin-3-O-b-D-glucoside. As an optional pathway, dihydrokaempferol can be converted to dihydroquercetin by F3959H or F39H, and then dihydroquercetin can be deoxidized to leucoanthocyanidin by dihydroflavonol-4-reductase (DFR; EC 1.1.1.219). Three DFR unigenes were identified in the safflower flower. Finally, leucoanthocyanidin can be catalyzed to catechin by leucoanthocyanidin reductase (LAR; EC 1.17.1.3). The LAR family was the second largest family identified in this pathway; 10 members were found in the safflower, ranging from 202 bp to 1,382 bp.
Although we searched for flavone synthase (FNS) genes in the transcriptome sequencing data of the safflower, we did not identify any such genes in this study. We also failed to find any anthocyanidine-encoding genes; however, we did identify four anthocyanidin synthase (ANS, also called leucoanthocyanidin dioxygenase)-encoding genes, suggesting that the genes encoding other anthocyanidine-related products may exist but have not yet been identified in the flower.

qRT-PCR Analysis of the Flavonoid Synthesis Related Genes in the Safflower
To confirm that the unigenes from sequencing and computational analysis were indeed expressed and also to analyze the difference of gene expression profile between red and white color flowers, twelve unigenes related to the ten families of safflower flavonoid synthesis were chosen for qRT-PCR analysis. In the qRT-PCR analysis, to get more information, we used four flower materials including of flowering day2 and day4 of both white and red flowers for further compare instead of day1 and day3 of flowering ( Figure 5, the ratios were resulted from the comparing analysis of each gene to 28S, unigene82078). In general, the red flower showed much higher gene expression profile than the white flower in most of the selected genes. Among them, the CHS gene (unigene96945) showed very high expression in the red flowers with the ratios of 174.16 and 84.5 on flowering day2 and day4 individually; comparing to the ratios in the red flower, this gene showed much lower expression in the white flowers with ratios of 9.85 and 11.08 on flowering day2 and day4 separately. This result indicated that CHS gene (unigene96945) is very important for safflower flavonoid synthesis. The similar expression profile can be observed when considering the two detected CHI genes (unigene16741 and unigene27082); they showed much higher expression in the red flower than in the white flower, especially the unigene27082 with ratios of 32.0 and 41.64. For the two F3H genes (unigene45665 and unigene4561), they showed a relatively medium expression in the two color flowers. The F39H gene (unigene104057) also showed very high expression on flowering day2 of the red flower with ratio of 95.01, suggesting that at the early stage of flowering (day2), hydroxylation is an important modification for the red flower flavonoid synthesis. Similar expression profiles can also be observed in the genes of FLS (unigene29190), DFR (unigene98353) and 3GT (uni-gene71390). On the contrary, the F3959H gene (unigene2547) showed higher expression in the later stage of flowering (day4) than in the early stage of flowering (day2) in both white and red flowers. Among all of the twelve genes, only the LAR gene (unigene25256) showed much higher expression in the white flowers with ratios of 13.09 and 8.17 on day2 and day4 individually comparing to the ratios of 0.30 and 0.22 on day2 and day4 in the red flowers, indicating that as a substrate much more leucoanthocyandin is produced in the white flower than in the red flower.

Discussion
High-throughput mRNA sequencing technology is especially suitable for gene expression profiling in non-model organisms that lack genomic sequence data. Prior to this study, most sequencing efforts in the safflower were based on EST sequencing; very few tags had been reported in public databases and little genetic or genomic information was available. Here, we used RNA-Seq technology to profile the safflower flower transcriptome on the Illumina HiSeq TM 2000 platform, obtained 4.6 Gb of coverage with 52,119,104 clean sequencing reads, and identified a total of 120,778 unigenes from de novo assembly. Among them, 70,342 unigenes were successfully annotated (about 58% of the assembled unigenes), suggesting their relatively conserved functions. We assessed not only the gene or protein names and descriptions, but also their putative conserved domains, gene ontology terms, and potential metabolic pathways. This work will certainly improve our understanding of the processes involved in regulating the biosynthesis of secondary metabolites and the development of safflower flower tissues. Compared with previous transcriptomic studies in other plants, such as Acacia auriculiformis, Acacia mangium [41], Eucalyptus [42], and Taxus [43], we herein report more contigs and unigenes, suggesting that the safflower contains very abundant gene resources. To the best of our knowledge, this is the first attempt using Illumina paired-end sequencing technology for the de novo sequencing and assembly (without a reference genome) of the safflower flower transcriptome. We believe our data will provide important new insights and facilitate further studies of safflower genes and their functions.
In terms of medicinal and pharmaceutical use, the properties of the safflower flower largely depend on its flavonoid profile. Since gaining new insight into the biosynthesis and transcriptional regulation of flavonoids in the safflower should accelerate the engineering of this pathway in the future, we focused on the flavonoid biosynthesis genes identified in the present work. Almost all of the genes for the main flavonoid biosynthesis pathway were identified (except for the flavone synthesis gene), indicating that this pathway was rather well conserved in the safflower (Figure 4 and File S3). Many of these genes appeared to form multi-gene families, implying that the genome of the safflower, like those of many other higher plants, went through one or more rounds of genome duplication during its evolution. However, the actual flavonoid pigments of plants are highly diverse in structure and color, reflecting the diversity of plants and their metabolic pathways. Indeed, the safflower is likely to possess some unique, yet-unidentified flavonoid-related molecules. Therefore, we further surveyed other genes that might be involved in the various branch or terminal pathways of flavonoid biosynthesis. KEGG predictions allowed us to identify an additional 227 genes that also contributed to generalized flavonoid biosynthesis, including members of the CYP family (electron carrier/heme binding/iron ion binding/ monooxygenase/oxygen binding), the O-methyltransferase, Ohydroxycinnamoyltransferase family, and the glycosyltransferase family, which are involved in uniquely modifying the flavonoid skeleton to yield the final products (the other related genes are given in File S3).
The safflower is a multipurpose crop, but its main use worldwide is as an oil crop. Its main seed oil components are linoleic acid (71-75%), oleic acid (16-20%), palmitic acid (6-8%) and stearic acid (2-3%) [1], most of which are unsaturated fatty acids. Safflower petals also contain alpha linolenic acid (15-19%), palmitic acids (14-16%), gamma linolenic acid (2-3%), and decanoic and dodecanoic acids (2-5%) [44]. Safflower linoleic acid may be beneficial for weight loss and/or glycemic control for type-2 diabetes [45], while dietary supplementation with safflower oil and olive oil was shown to rescue aberrant embryonic arachidonic acid and nitric oxide metabolism and prevent embryopathy in diabetic rats [46]. In our sequencing of transcripts from 1-3d flowers, we identified 149 genes that distributed to the ''biosynthesis of unsaturated fatty acids'' category (File S4; sequencing depth .10). These were mainly fatty acid desaturases, including stearoyl-ACP desaturase, which was cloned in 1992 by Knutzon et al. [47], along with FAD2, FAD3, FAD5A, FAD6, FAD7 and FAD8 [48,49,50]. The expression of these gene products during this flowering stage suggests that the unsaturated fatty acid biosynthesis pathway is rather well conserved in the safflower with respect to Arabidopsis, Rape and Cabbage. Although we only identified one oil biosynthesis gene in safflower flowers, namely that encoding stearoyl-ACP desaturase, our data may facilitate the future study of other related genes.
In conclusion, this sequence collection represents the first major genomic resource for the safflower, Carthamus tinctorius L. Using Illumina sequencing technology, we surveyed the flower transcriptome of the safflower, assembled 120,778 unigenes and annotated 70,342 of these unigenes. These findings provided comprehensive coverage, suggesting that we identified almost of the all known genes from the major metabolic pathways operating in the safflower flower. Our KEGG prediction results suggest that the flavonoid biosynthesis and unsaturated fatty acid biosynthesis pathway-related genes are rather well conserved in the safflower. Overall, we identified more than 200 pathways as being represented in the safflower whole-flower transcriptome. Further analysis of these pathway-related genes will improve our understanding of the safflower's unique and common features. This study again demonstrates that Illumina sequencing technology may be applied as a rapid and cost-effective method for de novo transcriptome analysis of non-model plants for which genomic information is unavailable. We believe that this transcriptome dataset will serve as an important public information platform to accelerate research on the gene expression, genomics, and functional genomics of Carthamus tinctorius L.

Tissue Collection and RNA Isolation
Safflower plants were cultivated in the traditional Chinese medicine cultivation zone of the town of Shipan (Jianyang county Sichuan Province, China; 30.24?N 104.32?E). This zone was built by our institute (Institute of Economic Crops Breeding and Cultivation, Sichuan Academy of Agricultural Sciences) for the collection and cultivation of Chinese herbal medicine resources. The safflower is not on the list of endangered/protected plants, and no specific permission was required for us to use this material. The safflower is generally planted in late October, and then flowers during the first two weeks of the following May. Tubular flower tissues of 1-day-old (1d), 2d and 3d were cultivated from red/orange and white Carthamus tinctorius L. The white flower is a normally occurring mutant of the red/orange flower ( Figure 1A). Total RNA was isolated from each sample using TRIzol (Invitrogen) according to the manufacturer's instructions. Total RNA was treated with RNase-free DNase I (New England BioLabs) for 30 min at 37uC to remove residual DNA. Equal amounts of RNA from each sample were mixed for the subsequent steps of our experiments. The workflow is shown in Figure 1B.
Preparation and Sequencing (mRNA-seq) of the cDNA Library The workflow described in this section is shown in Figure 1B. In brief, RNA was collected from flower tissues, and oligo(dT) beads were used to isolate poly(A) mRNA. Fragmentation buffer was used to chop the mRNA into short fragments, which were then used as templates for random hexamer-primed synthesis of firststrand cDNA. Second-strand cDNA was synthesized using buffer, dNTPs, RNase H, and DNA polymerase I. From these cDNA, a paired-end library was synthesized using the Genomic Sample Prep kit (Illumina), according to the manufacturer's instructions. Short fragments were purified with the QIAquick PCR (Qiagen) extraction kit and then resolved with EB buffer for end repair and the addition of poly (A). The short fragments were then connected with sequencing adapters, and suitable fragments were separated by agarose gel electrophoresis. Finally, the sequencing library was built by PCR amplification and sequenced using the HiSeq TM 2000 platform (Illumina). The transcriptome datasets are available at the NCBI Sequence Read Archive (SRA), under accession number SRA048496.

Analysis of Illumina Transcriptome Sequencing Results
A simplified workflow of the transcriptome assembly and bioinformatic analysis is shown in Figure 1C and 1D, respectively. The raw sequencing data were transformed by base calling into sequence data (called raw data or raw reads), which were stored in fastq format. Adaptor fragments were removed from the raw reads to yield the clean reads required for analysis. De novo transcriptome assembly of these short reads was performed using the SOAPdenovo assembling program, which first combined reads with a certain length of overlap to form longer fragments without Ncalled contigs. The reads were mapped back to contigs, and paired-end reads were used to detect contigs arising from the same transcript as well as the distances between these contigs. Next, SOAPdenovo connected the contigs, using ''N'' to represent unknown sequences, yielding scaffolds. Paired-end reads were then used again to fill in the gaps between the scaffolds, yielding sequences that had the fewest Ns and could not be extended on either end. These were designated as unigenes.
For further analysis, we first used BLASTX (E-value ,10 25 ) to search the unigene sequences against various protein databases, in the following order: Nr, SwissProt, KEGG and COG. Unigene sequences having hits in one database were not used to search the subsequent database(s). The BLAST results were then used to extract CDS from the unigene sequences, and translate them into peptide sequences. The BLAST results were also used to train ESTScan. The CDS of unigenes that had no BLAST hits were predicted by ESTScan and then translated into peptide sequences. Unigene annotation provided information on the expression patterns and functional annotations of the identified genes. For Nr annotation, we used the BLAST2GO program to obtain the GO annotations of the unigenes, and then used the WEGO software to perform GO functional classifications for all the unigenes and to explore the macro-distribution of gene functions for this species.

Analysis of Metabolic Pathway Genes Identified from Carthamus Tinctorius L. Flowers
The metabolic pathway analysis was done using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and related software applications (http://www.genome.jp/kegg/ kegg4.html), which form a bioinformatics resource for linking genomes to living organisms [51]. Within the KEGG databases, the PATHWAY database contains information on networks of molecular interactions within cells, as well as the variations specific to particular organisms. The genes involved in flavonoid and unsaturated fatty acid biosynthesis were identified using BLAST annotation of KEGG and the other databases mentioned above. To increase the credibility of our results, we selected genes that had coverages .10.

Gene Validation and Expression Analysis
Twelve selected unigenes with potential roles in flavonoid synthesis were chosen for validation using real time qPCR with gene specific primers designed with primer premier 3.0 (the sequences of the selected genes was shown in File S4 and the primers designed for qRT-PCR analysis was shown in File S6). Total RNA was extracted from day2 and day4 of red and white flowers individiually using trizol RNA extraction method (Invitrogen, USA) and purified with RNA purification kit (Qiagen, USA). The purified RNA was reverse transcript to cDNA using PrimeScritH RT reagent kit with gDNA Eraser (Perfect Real Time) (Takara, China). 14 genes were first chose for RT-PCR analysis and 12 of them were further chose for qRT-PCR. All the genes used for qRT-PCR were first performed RT-PCR analysis and run gel to get a good condition for qRT-PCR analysis. The qRT-PCR reaction was performed according to the protocol of Green Mastermix (Qiagen, USA) using ABI 7500 system. Three biology duplications of each sample (each sample containing flowers from 4 individual plants) and triplicates of each reaction were performed. 28S gene was chosen as an internal control for normalization after comparison the expressions of two reference genes (15S, unigene104038 and 28S) in different materials.

Supporting Information
File S1 Summary of the involved KEGG pathways and their corresponding gene numbers.