De Novo Transcriptome of Brassica juncea Seed Coat and Identification of Genes for the Biosynthesis of Flavonoids

Brassica juncea, a worldwide cultivated crop plant, produces seeds of different colors. Seed pigmentation is due to the deposition in endothelial cells of proanthocyanidins (PAs), end products from a branch of flavonoid biosynthetic pathway. To elucidate the gene regulatory network of seed pigmentation in B. juncea, transcriptomes in seed coat of a yellow-seeded inbred line and its brown-seeded near- isogenic line were sequenced using the next-generation sequencing platform Illumina/Solexa and de novo assembled. Over 116 million high-quality reads were assembled into 69,605 unigenes, of which about 71.5% (49,758 unigenes) were aligned to Nr protein database with a cut-off E-value of 10−5. RPKM analysis showed that the brown-seeded testa up-regulated 802 unigenes and down-regulated 502 unigenes as compared to the yellow-seeded one. Biological pathway analysis revealed the involvement of forty six unigenes in flavonoid biosynthesis. The unigenes encoding dihydroflavonol reductase (DFR), leucoantho-cyanidin dioxygenase (LDOX) and anthocyanidin reductase (ANR) for late flavonoid biosynthesis were not expressed at all or at a very low level in the yellow-seeded testa, which implied that these genes for PAs biosynthesis be associated with seed color of B. juncea, as confirmed by qRT-PCR analysis of these genes. To our knowledge, it is the first time to sequence the transcriptome of seed coat in Brassica juncea. The unigene sequences obtained in this study will not only lay the foundations for insight into the molecular mechanisms underlying seed pigmentation in B.juncea, but also provide the basis for further genomics research on this species or its allies.


Introduction
Brassica juncea (L.) Czern & Coss is grown in China, India, Canada, Australia and Russia as an oilseed, condiment or vegetable crop [1,2,3]. This species produces black, brown or yellow seed. The varieties distinct in seed color have many differences in agronomic traits such as oil and protein content [4,5], seed dormancy [6], seed germination [7]. The seed pigments of the Brassica species are predominantly secondary flavonoid metabolites, i.e., flavonols and proanthocyanidins (PAs) [8,9]. Flavonols accumulate in cotyledons and give seeds a yellow cue while PAs, also known as condensed tannins, deposit only in seed coat and bring about browning through oxidation [10]. PAs, oligomers of (2)-epicatechin, are synthesized in the late stage of flavonoid biosynthesis from dihydroflavonols by catalysis of dihydroflavonol 4-reductase (DFR), leucoanthocyanidin reductase (LODX) and anthocyanidin reductase (ANR) [11]. The genes encoding these enzymes and many other genes participating in flavonoid biosynthesis have been cloned from a dozen of plant species including Arabidopsis [11], grape [12], soybean [13]. In Arabidopsis, mutants of the genes for flavonoid biosynthesis produce transparent testa (tt) and yellow or pale yellow seed. However, only few genes regulating flavonoid biosynthesis have been found in Brassica species, close allies of Arabidopsis. For example, in B. napus, DFR is found to be associated with seed pigmentation by comparison of DFR activity and PA accumulation in seed coat between a yellow-seeded line and its related black-seeded counterpart [14]. In B. juncea, DFR and LODX are confirmed to participate in seed pigmentation by RT-PCR analysis for gene expression in seed coat between a yellow-seeded line and its brown-seeded near-isogenic lines (NILs) [15,16]. In B. rapa, the TRANSPARENT TESTA GLABRA 1 (TTG1) [17] and the TRANSPARENT TESTA 8 (TT8) [18] encoding a WD-40 regulatory protein and a basic Helix-Loop-Helix transcription factor respectively, are suspected to be involved in seed pigmentation. These results suggest that Brassica species synthesize PAs and control the final seed color in a manner analogous to Arabidopsis. However, the molecular mechanism underlying seed color is not fully understood in Brassica. It is also worth noting that the allotetraploid B. juncea and B. napus and even the mesopolyploid B. rapa have much larger and more complex genome than Arabidopsis. As such, for a single-copy gene in Arabidopsis, there may be multiple copies in Brassica species. For instance, ANR [19] and TT16 [20] each has four copies, and DFR two copies [14] in B. napus. To date, only a few flavonoid biosynthetic genes have been completely cloned from Brassica species [21].
Recently, next-generation sequencing (NGS) technologies have emerged as a revolutionary and high throughput approach that accelerates the genome sequencing and gene function research [22,23]. NGS-based RNA-seq can provide massive sequences with enormous depth and coverage to easily discover novel gene, splice junctions, fusion transcripts and to make a comparison of gene expression between samples or tissues [22,23]. It has been applied to analysis of transcriptomes of dozens of plant species including B. napus [24]. However, RNA-seq has not been used to study B. juncea seed coat up to now.
In this study, two cDNA libraries generated from RNA of seed coat of a Brassica juncea yellow-seeded inbred line and its brownseeded near-isogenic line were sequenced using Illumina/Solexa platform and de novo assembled into unigenes. The expression level of unigenes was compared between two libraries on the basis of read per exon kilobase per million (RPKM) analysis. 1,304 unigenes were found to be differentially expressed by at least 2fold. Forty six unigenes were identified for flavonoid biosynthesis by KEGG analysis. The PA biosynthetic genes such as DFRs, LDOXs and ANRs were found to be significantly down-regulated in the yellow-seeded testa as compared with its black-seeded counterpart, as confirmed by qRT-PCR analysis. These results imply the involvement of PA biosynthetic genes in seed pigmentation in B. juncea.

Results and Discussion
Seed coat transcriptome sequencing and de novo assembly Total RNA was extracted from seed coat of the Brassica juncea yellow-seeded inbred line Sichuan Yellow Seed (SY) and its brown-seeded near-isogenic line A (NILA, BC 8 F 5 ) 15 days after pollination (DAP) [25]. The phenotypes of the mature seeds of these lines are shown in Figure 1. A cDNA library was constructed for SY and NILA seed coat, respectively. These cDNA libraries were sequenced by using Illumina HiSeq TM 2000, generating 60,371,476 and 57,465,530 reads of 100-bp in length, respectively. After removing adaptor sequences, lowquality and ambiguous reads, 59,735,444 and 56,423,676 highquality reads were obtained. An overview of the sequencing and assembly is listed in Table 1. All the raw transcriptome data have been deposited at the sequence read archive (SRA) of the National Center for Biotechnology Information (NCBI). By using the software CLC Genomic Workbench 4.9 (CLC Bio, Denmark), a total of 116,159,120 high-quality reads were assembled into 99,096 contigs, with a minimum contig size of 200 bp, a maximum size of 10,232 bp, an N50 of 1,060 bp and an average contig length of 664 bp. By performing the pair-end joining and gap filling, a total of 79,520 scaffolds were produced, with a minimum scaffold size of 200 bp, a maximum size of 12,040 bp, an N50 of 1,284 bp and an average length of 859 bp. The size distribution of the scaffolds is shown in Figure 2A. By using the software CD-HIT (V.4.5.4) [26], the scaffolds were further assembled into 69,605 unigenes, with a minimum scaffold size of 200 bp, a maximum size of 12,040 bp, an N50 of 1,307 bp and an average length of 868 bp. The size distribution of the unigenes is shown in Figure 2B.

Functional annotation
For functional annotation of B.juncea seed coat transcriptome, the unigene sequences were blasted against NCBI non redundant (Nr) protein database using a cut-off E-value of 10 25 . A total of 49,758 unigenes (71.5% of all the assembled unigenes) were aligned to Nr protein database. Among these aligned unigenes, 62.01% had an E-value of less than 1.0E 250 and showed very strong homology while the remaining 37.99% had an E-value of between 1.0E 25 to 1.0E 250 ( Figure 3A). The similarity distribution showed that 50.06% of these aligned unigenes had a similarity higher than 90%, 46.21% between 60% and 90% and only 3.72% lower than 60% ( Figure 3B). For species distribution, approximately 96.49% of these aligned unigenes were matched with sequences from 6 top-hit species, i.e., Arabidopsis lyrata (46.59%), A. thaliana (40.59%), Thellungiella halophila (3.37%), B. rapa (2.38%), B. napus (1.87%), and B. oleracea (1.69%), which all fall into Brassicaceae. The 20 top-hit species based on Nr annotation are shown in Figure 3C.

Gene Ontology (GO) classification
GO classification based on sequence homology revealed that 19,618 out of all the assembled unigenes were categorized into 37 functional groups ( Figure 4). The three major categories (biological process, cellular component, and molecular function) were assigned 31,026, 22,918 and 26,267 GO terms, respectively ( Figure 4). In the 'biological process' category, the unigenes related to 'metabolic processes' (58.85%) and 'cellular processes' (55.05%) were predominant while in the 'cellular component' category, 'cell parts' (42.05%) and 'cell' (42.05%) were found to be the most abundant class. Under the 'molecular function' category, the majority of unigenes were involved in 'binding' (63.28%) and 'catalytic activities' (55.09%).

KEGG classification
Analysis of 69,605 unigenes through Kyoto Encyclopedia of Genes and Genomes (KEGG) [27] showed that 14,998 unigenes were assigned to 258 pathways (Table S1). The major pathways containing hundreds of unigenes were metabolic pathways (3,506 unigenes, accounting for 23.37%), biosynthesis of secondary metabolites (1,785, 11.9%), microbial metabolism in diverse environments (802, 5.35%), RNA degradation (538, 3.59%), and ribosome (535, 3.57%). We concentrated on the 'biosynthesis of secondary metabolites' pathway in relation to seed pigmentation and found 154 unigenes for 'Phenylpropanoid biosynthesis', 114 for 'Phenylalanine, tyrosine and tryptophan biosynthesis', 46 for 'Flavonoid biosynthesis' and 9 for 'Flavone and flavonol biosynthesis' in B.juncea seed coat.    Identification of transcription factors in the seed coat transcriptome All the assembled unigenes were aligned against the AGRIS (Arabidopsis Gene Regulatory Information Server) database by Blastx with E value of below 10 25 and identity of over 70%. A total of 2,347 unigenes were indentified to belong to forty eight putative transcription factors (TF) families (Table S2). For MYB and bHLH families, two major TF families whose members regulate flavonoid biosynthesis in plants [10,12,28], 100 and 190 unigenes were identified, respectively.

Transcripts differentially expressed between the yellowand brown-seeded testa
To investigate the expression level of unigenes in the yellowand brown-seeded testa, the number of clean reads was compared between the libraries for each of 69,605 assembled unigenes through RPKM analysis [29]. 1,304 unigenes were found to be differentially expressed between the yellow-and brown-seeded testa, among which 802 unigenes were up-regulated while 502 down-regulated in the brown-seeded testa as compared to the yellow-seeded counterpart ( Figure 5; Table S3). Among these unigenes, 170 (12.8%) showed over 15-fold change in expression level as well as 471 (36.4%) showed 2,3-fold change. The fold change distribution of unigenes differentially expressed between the testae was shown in Figure 6. Annotation of differentially expressed unigenes revealed that 455 unigenes belonged to 28 GO groups while the remaining 849 unigenes could not to be classified (Figure 7).

Genes from the seed coat transcriptome involved in the flavonoid biosynthetic pathway
In Arabidopsis, PA synthesis initiates in the endothelial cells of seed coat at 3 DAP although expression of the genes for PA biosynthesis gradually decreases from 3 DAP onwards [30]. In B. juncea, PA accumulation does not occur in seed coat until 10 DAP [31]. Array analysis showed that six flavonoid genes (chalcone synthase (CHS), flavonoid 3-hydroxylase (F3H), flavanol (quercetin) 3-Omethyltransferase (FOMT), DFR, glutathione S-transferase (GST), and TTG1) had higher and two (flavonoid 39-hydroxylase (F39H) and flavonol synthase (FLS)) had lower expression in the developing siliques (22 DAP) of the B. carinata brown-seeded accession than in those of the yellow-seeded one [32]. Comparison between secondary-wall-enriched seed coat and developing hypocotyls showed that the transcripts of flavonoid biosynthetic genes ANR, FLS and CHS were more abundant in the middle fraction of seed coat than in the developing hypocotyls of B. napus [33]. This mean that flavonoid biosynthetic genes are expressed in seed coat at a higher level, which is consistent with sole deposition of PAs in seed coat [8]. In this study, we used the seed coat for analysis, which eliminates interference from the silique wall or even the embryo. As mentioned above, forty six unigenes were found to be involved in flavonoid biosynthesis of B. juncea seed coat. Among them, six unigenes for CHS, and both two unigenes for chalcone isomerase (CHI) were up-regulated and one unigene for FLS down-regulated in the brown-seeded testa as compared to the yellow-seeded one( Figure 8; Table S4). Previous RT-PCR analysis and activity assay of three enzymes DFR, LDOX and ANR for PA biosynthesis have shown that DFR and LDOX are not expressed in seed coat of Brassica yellow seeds [14,15,34]. Here, we found no DFR unigenes in the yellow-seed testa, but three DFR unigenes expressing at a high level in the brown-seeded testa with 10,510, 9,644 and 5,203 reads, respectively (Table S4). Similarly, three ANR unigenes were almost not expressed in the yellow-seed testa, although expressed highly in the brown-seeded testa. In addition, two LDOX unigenes were expressed at a much higher level in the brown-seeded testa than in the yellow-seeded one. Taken together, these results suggest that the genes for PA biosynthesis be not expressed or at a very low level and therefore no PAs accumulated in the yellow-seeded testa of B.juncea.   Real-time PCR analysis of the genes involved in the flavonoid biosynthesis pathway To confirm the difference in expression level between the accessions found in RPKM analysis, eight unigenes for flavonoid biosynthesis were chosen for qRT-PCR analysis (Figure 8). These unigenes were Unigene_920 (CHS), Unigene_29246 (CHI), Uni-gene_7597 (DFR), Unigene_7701 (LDOX), and Unigene_16036 (ANR) which were up-regulated in the brown-seeded testa, Unigene_28310 (FLS) down-regulated, Unigene_682 (F3H) and Unigene_396 (F39H) unchanged. qRT-PCR data confirmed expression pattern of these unigenes by RPKM analysis (Figure 9, Table S4).

Conclusion
To the best of our knowledge, this study is the first to apply the Illumina/Solexa platform to investigating the sequences and transcript abundances of genes expressed in seed coats of Brassica juncea. This transcriptome analysis has provided a total of 69,605 unigenes among which 71.5% are aligned to Nr database although there is no Brassica juncea reference genome sequence of available. Comparison in expression level between the yellow-and the brown-seeded testa totally identified 1,304 differentially expressed unigenes including the key PA biosynthetic genes such as DFRs, LDOXs and ANRs. In sharp contrast to high expression in the brown-seeded testa, expression of the PA biosynthetic genes and accumulation of PAs are inhibited in yellow-seeded testa, which suggests that PAs play a major role in seed pigmentation of B. juncea. The unigene sequences obtained in this study will not only lay the foundations for insight into the molecular mechanism underlying seed pigmentation in B.juncea, but also provide the basis for further genomics research on this species or its allies.

Plant materials
A yellow-seed inbred line Sichuan Yellow Seed (SY) and its brown-seeded near-isogenic line A (NILA) (Figure 1

Construction of cDNA library for sequencing
Total RNA was extracted from seed coats using TRIzol Reagent (Life technologies, US) following the manufacturer's instructions and checked for a RIN number to inspect RNA integration by an Agilent Bioanalyzer 2100 (Agilent technologies, US). Qualified total RNA was purified by RNeasy micro kit (QIAGEN, Germany) and RNase-Free DNase Set (QIAGEN, Germany). Poly(A) mRNA was isolated from qualified total RNA using Oligotex mRNA mini kit(QIAGEN, Germany) and then broken into short fragments which were used as templates for synthesis of the first-and the second-strand cDNA. Two paired-end libraries were synthesized by using the Genomic Sample Prep kit (Illumina, US) per the manufacturer's instructions. Short fragments were purified with the Qubit TM dsDNA HS Assay kit (Invitrogen, US) and connected with different sequencing adapters. Each of the two libraries had an average insert size of 400 bp and was sequenced by Shanghai Biotechnology Corporation (Shanghai, China) using Illumina HiSeq TM 2000. The transcriptome datasets are available at the NCBI Sequence Read Archive (SRA), under the accession number SRA072246.

De novo assembly
Pre-processing and assembling of the raw sequence data were carried out by using CLC Genomics Workbench (version 4.9). This pre-processing included the removal of the adapter sequences, low quality (Q20,20) sequences, shorter-than-15nucleotide sequences and ambiguous inner regions. For de novo assembly, contig scaffolding algorithm was adopted. The parameters were set as follow: length fraction (i.e. minimum length fraction of a read that must match the reference sequence) of 0.5, similarity ratio (i.e. minimum similarity of the match length of a read) of 0.8 and the minimum contig length (minimum length of an assembled contig to be reported) of 200 bp or longer. Then, the scaffolds were clustered by using CD-HIT (V.4.5.4) with 90% similarity cut-off to produce distinct unigenes.

Functional annotation
Function of unigenes was annotated by Blastx against the NCBI non-redundant (Nr) database or the AGRIS database (http://arabidopsis.med.ohio-state.edu/downloads html) [35] with E-value threshold of 10 25 . Base on Nr annotation, tophit species were identified. Gene ontology (GO) classification was obtained by WEGO [36] (http://wego.genomics.org.cn/ cgi-bin/wego/index.pl) via GO id annotated by Perl and R program. The unigenes sequences were also aligned to the COG database to predict and classify functions. KEGG pathways were assigned to the unigene sequences using the singledirectional best hit (SBH) method on the KEGG Automatic Annotation Server (KAAS) online [37].

Expression analysis
The RPKM method was used to calculate the unigene expression in this study. The clean reads of each library were mapped to the sequences of each unigene. The significance of difference in gene expression between the yellow-and brownseed testa was determined by using DEGseq, an R package [38]. False discovery rate (FDR) was applied to identify the threshold of the P value in multiple tests [39]. When FDR is less than 0.05 and log 2 ratio greater than 1 (two-fold change) between the accessions the unigenes were considered as differentially expressed.

Real-time quantitative RT-PCR(qRT-PCR) analysis
Eight unigenes involved in flavonoid synthesis were chosen for validation by qRT-PCR. The primers designed with the software Primer Premier 5.0 were listed in Table S5. Total RNA was extracted from seed coat tissues 15 DAP using TRIzol RNA extraction method (Tiangen, China) and reverse transcribed into cDNA using PrimeScritH RT reagent kit with gDNA Eraser (Perfect Real Time) (Takara, China). The qRT-PCR was performed with a Bio-Rad CFX-96 RealTime PCR System (Bio-Rad, US) in a final volume of 20 ml containing 2 ml cDNA, 10 ml 26SYBR premix Ex taqTM (Takara, Japan), 0.4 ml 10 mM the forward and reverse primers each, and 7.2 ml RNase-free water. The thermal cycling conditions were as follows: 95uC 5 min, 45 cycles at 95uC for 5 s for denaturation and 56uC for 25 s for annealing and extension. TIPS-41 [40] was used as an internal control for normalization to make a comparison of gene expression level between the accessions.