RNA-seq transcriptome analysis of the immature seeds of two Brassica napus lines with extremely different thousand-seed weight to identify the candidate genes related to seed weight

Brassica napus is an important oilseed crop worldwide. Although seed weight is the main determinant of seed yield, few studies have focused on the molecular mechanisms that regulate seed weight in B. napus. In this study, the immature seeds of G-42 and 7–9, two B. napus doubled haploid (DH) lines with extremely different thousand-seed weight (TSW), were selected for a transcriptome analysis to determine the regulatory mechanisms underlying seed weight at the whole gene expression level and to identify candidate genes related to seed weight. A total of 2,251 new genes and 2,205 differentially expressed genes (DEGs) were obtained via RNA-seq (RNA sequencing). Among these genes, 1,747 (77.61%) new genes and 2020 (91.61%) DEGs were successfully annotated. Of these DEGs, 1,118 were up-regulated and 1,087 were down-regulated in the large-seed line. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database analysis indicated that 15 DEGs were involved in ubiquitin-mediated proteolysis and proteasome pathways, which might participate in regulating seed weight. The Gene Ontology (GO) database indicated that 222 DEGs were associated with the biological process or molecular function categories related to seed weight, such as cell division, cell size and cell cycle regulation, seed development, nutrient reservoir activity, and proteasome-mediated ubiquitin-dependent protein catabolic processes. Moreover, 50 DEGs encoding key enzymes or proteins were identified that likely participate in regulating seed weight. A DEG (GSBRNA2T00037121001) identified by the transcriptome analysis was also previously identified in a quantitative trait locus (QTL) region for seed weight via SLAF-seq (Specific Locus Amplified Fragment sequencing). Finally, the expression of 10 DEGs with putative roles in seed weight and the expression of the DEG GSBRNA2T00037121001 were confirmed by a quantitative real-time reverse transcription PCR (qRT-PCR) analysis, and the results were consistent with the RNA sequencing data. This work has provided new insights on the molecular mechanisms underlying seed weight-related biosynthesis and has laid a solid foundation for further improvements to the seed yield of oil crops.


Introduction
Brassica napus is an important oilseed crop that is used as an edible oil and as an animal feed worldwide. In recent years, this crop has been popularly used as an ornamental plant and renewable energy source in China [1]. In addition to silique number per plant and seed number per silique, seed weight is one of the three major components of yield for oilseed rape [2]. Increasing the seed weight is the primary approach to improve the yield of B. napus [2]. Only on chromosome A09, 17 QTL were identified for seed weight [3], which showed the genetic complexity of seed weight regulation. Therefore, understanding the regulatory mechanisms underlying gene expression and selecting the appropriate candidate genes for seed weight are of great significance for yield improvement in oilseed rape breeding.
However, few reports have focused on the molecular mechanisms that regulate the differences of seed weight in B. napus, which may be related to complicated quantitative characteristics or environment [4]. Recently, the genome sequence of B. napus has been reported, and it can be utilized to study the regulatory mechanisms underlying seed weight [2]. Several seed weight (size) genes have been successfully cloned, such as those in rice and Arabidopsis thaliana. In B. napus, Liu et al. [5] reported that variations in the auxin-response factor 18 (ARF18) genes on chromosome A9 could regulate the seed weight without changing the pod, and it represents the first polyploidy crop yield gene acquired by map-based cloning. The molecular mechanism underlying yield genes has frequently been studied in rice, and the results provide rich information for our study on the seed weight of B. napus. Many genes related to grain weight and size have been reported in rice, including GW2, GW5, GL3, and GS5. GW2 and GW5 have similar functions in regulating grain weight. Song et al. [6] revealed that GW2 encodes a ring-type E3 ubiquitin ligase. Loss of GW2 function regulates cell division and increases the cell number, enlarges the spikelet hull, accelerates the grain milk-filling rate, and results in enhanced grain width, weight and yield. Weng et al. [7] found that GW5 most likely regulates cell division during grain development through the ubiquitin-proteasome pathway. There are three key enzymes required for the ubiquitin-proteasome pathway: E1, E2 and E3 [8]. Ubiquitin is initially activated by E1, and then E2 and E3 work in conjunction to recognize the substrate protein and conjugate it to ubiquitin. Ubiquitin is then easily attached as a previously synthesized chain or a monomer. Finally, the ubiquitinated protein is transported to the proteasome for degradation [9]. Qi et al. [10] noted that GL3.1 encodes Ser/Thr phosphatase, a member of the protein phosphatase kelch (PPKL) family, and found that GL3.1 accelerates cell division by influencing protein phosphorylation in the spikelet, thereby resulting in longer grains and higher yields in rice. Further studies showed that GL3.1 directly dephosphorylates its substrate Cyclin-T1;3. The down-regulation of Cyclin-T1;3 in rice results in a shorter grain, which occurs through a novel phosphatase-mediated process during cell cycle progression. GS5 encodes a putative serine carboxypeptidase that acts as a positive regulator of cell division. The high expression of GS5 promotes and accelerates the cell cycle, and it then accelerates cell division of the outer glume and ultimately enlarges the grain size and increases the grain weight [11]. In A. thaliana, the cloned genes primarily regulated cell division, although they also controlled the cell number or cell size of the integument, embryo, and endosperm and affected the final seed weight or seed size. Du et al. [12] reported that SUPPRESSOR2 OF DA1 (SOD2) encodes ubiquitin-specific protease15 (UBP15), which regulates seed size by promoting cell division in the integuments of ovules and developing seeds. DA1 and DA2 encode a ubiquitin receptor and a RING-type protein with E3 ubiquitin ligase activity, respectively, which determine the final seed weight and organ size by restricting the period of cell proliferation [13,14]. The ARF2 gene links auxin signalling, cell division, and the final size of seeds and other organs [15]. Jofuku et al. [16] revealed that APETALA2 (AP2) plays an important role in determining the seed size, seed weight and seed oil and protein accumulation, and it acts via the maternal sporophyte and endosperm genomes to control seed weight and seed yield. Previous research on seed weight has provided helpful information for our work identifying candidate genes for seed weight via transcriptome analyses.
Next-generation sequencing (NGS) technology has been widely used in biological research. Because the genomes of a large number of valuable species have been published, NGS technology has rapidly developed [17]. The transcriptome is the set of all messenger RNA and noncoding RNA molecules in one cell or a population of cells for a specific development stage or physiological condition. Transcriptome analysis is an important method of studying and exploring functional genes in plants. Compared with genome analyses, transcriptome analyses only assess transcribed genes; thus, they have a smaller research scope, which may produce more pertinent results [18]. In recent years, RNA-seq technology has rapidly developed into the most important method for transcriptome profiling [19]. Plant transcriptome analyses could provide considerable information on highly expressed genes, differentially expressed genes and new genes [20][21][22]. Transcriptome sequencing has been widely applied in studies on Brassica, such as transcriptome profile analysis of young floral buds of fertile and sterile plants [23], transcriptome profiling of resistance to pathogen [24] and research to narrow down the number of candidate genes in identified quantitative trait locus (QTL) regions [25].
In this study, two transcriptomes were compared for the immature seeds of two B. napus lines with extremely different thousand-seed weight (TSW). A total of 2,251 new genes and 2,205 (differentially expressed genes) DEGs were obtained. A bioinformatics analysis was performed to investigate those differentially expressed genes related to seed weight. A DEG (GSBRNA2T00037121001) involved in the identified QTL region for seed weight is an important candidate gene for further seed weight research. These databases will provide an important resource and helpful insights for identifying candidate genes related to seed weight and improving the final yield of B. napus.

Comparison of the seed weight, seed size and fresh seed weight at different seed stages between two extreme lines
The results showed that the seed size and seed weight of the large-seed line DH-G-42 were significantly greater than those of the small-seed line DH-7-9 (Fig 1, Table 1). The statistical data for the fresh seed weight revealed that there were few differences between the large-seed line and the smallseed line during 1 and 2 weeks after flowering (WAF). However, from 3 to 6 weeks after flowering, the fresh weight of DH-G-42 was obviously higher than that of DH-7-9 (Table 1). Thus, the 3 rd week might be the key period in which the two lines differentiate. In B. napus, from 20 days after flowering to maturity was the fastest growing period for seed development, which supported our results.

Transcriptome sequencing
After the construction and sequencing of two cDNA libraries, a total of 9.06 G and 7.16 G nucleotides were generated from T3 (large-seed) and T4 (small-seed) libraries (named by the Sequencing Co.), respectively ( Table 2). After removing the adaptor sequences, low quality and short reads, 7.16 G and 6.88 G data were obtained for large seed line and small seed line, respectively. The N content was 0% in both libraries. The Q30 percentage exceeded 90%, and the GC (guanine-cytosine) content was 46.57% and 47.38% for the large-seed and small-seed lines, respectively, which suggests that the sequencing data were highly accurate and reliable. The total reads of large (70,844,224) and small seed line (67,971,356) were blasted with the reference genome of B. napus. On average, 82.04% and 82.75% of the reads were successfully mapped to the reference genome by Tophat (v2.0.12) ( Table 3).

Analysis of new genes and differentially expressed genes (DEGs)
After filtering the genes that only contained one exon or encoded short peptide chains (less than 50 amino acid residues), a total of 2,251 new genes and 97,963 genes were revealed by blasting the reference genome using Cufflinks. The sequences and detailed information are listed in S1 and S2 Tables.
To identify the DEGs between the large-seed and small-seed samples, the Fragments Per Kilobase of exon per Million fragments mapped (FPKM) was used to calculate the gene expression levels. The results showed that a total of 2,205 genes were differentially expressed between two libraries, of which, 1,118 genes were up-regulated and 1,087 genes were down-regulated in T3 compared with T4. Detailed information is listed in S3 Table. The up-regulated and downregulated genes between T3 and T4 were revealed by a hierarchical clustering analysis. The results of the hierarchical clustering analysis based on the FPKM values (Fig 2). It was found that about 29% DEGs with the log 2 FPKM value was lower than 2, 63% DEGs with the log 2 FPKM value was between 2 and 6, 8% DEGs had the log 2 FPKM value between 6 and 10, and one DEG (GSBRNA2T00002721001) owned the log 2 FPKM value up to 11.89. In Fig 2, all gene expression levels seemed to be low, that is because we set the reference level of DEGs' log2FPKM value (10) was much higher than normal (6) between T3 and T4. Furthermore, there were too many genes expressed during the seed development, a higher reference level facilitated to screen the differentially expressed genes between T3 and T4.

Functional annotations
To functionally annotate the B. napus seed transcriptome, the 2,251 new genes and 2,205 DEGs were applied in a blast search of the GO (Gene Ontology), COG (Cluster of Orthologous Group), KEGG (Kyoto Encyclopedia of Genes and Genomes), Swiss-Prot and NR (non-redundant) databases using BLAST (Basic Local Alignment Search Tool) software, which successfully annotated 1,747 new genes and 2,020 DEGs. All of the annotated information is listed in Table 4 and S4 Table. Of these new genes, 1,733, 1,069, 1,321, 336 and 276 were annotated in the GO, COG, KEGG, Swiss-Prot and Nr databases, respectively. For the GO classification analysis of DEGs, 1,763 genes were assigned to three main GO functional categories and then divided into 57 sub-categories (Fig 3). Several DEGs were assigned to more than one sub-category. We then calculated the percentage of DEGs involved in each sub-category (S5 Table). The sub-category with largest percentage in the "cellular component" category was cell part (DEGs accounted for 89.17% of all genes involved in this category), followed by cell (88.94%) and organelle (80.77%). The sub-category with the largest percentage in the "molecular function" category was binding (52.98%) and then catalytic activity (44.36%). The sub-category with the largest percentage in the "biological process" category was cellular process (76.52%), followed by metabolic process (71.47%), response to stimulus (55.42%) and biological regulation (47.87%). These results indicated that the primary cause resulting in different seed weight might be related to the differential expression of genes in these nine sub-categories, which would provide a direction for further analysis.
The COG functional classification analysis showed that the DEGs were distributed across 25 COG categories (Fig 4). We also calculated the percentage of DEGs in each category (S6 Table) and found that the largest percentage was "General function prediction only" (20.69%), followed by "Transcription" (12.24%), "Posttranslational modification, protein turnover, chaperones" (12.09%) and "Replication, recombination and repair" (10.47%). Many genes were differentially expressed in these categories, which might explain the different seed weights in the large-seed and small-seed lines.
Information on the metabolic pathways of the transcriptome is valuable for understanding the physiological processes of seed development after flowering. To evaluate the differences between the two lines, the metabolic pathways of the DEGs were analyzed by classification. An analysis using the KEGG database on biological pathways showed that a total of 493 genes were assigned to 97 pathways (Table 4; S7 Table). However, the number of genes involved in these 97 pathways was 505 instead of 493, which suggests that certain genes might be involved in more than one KEGG pathway, such as the gene "Brassica_newGene_5407", which is involved in cysteine and methionine metabolism; glycine, serine and threonine metabolism; and lysine biosynthesis. Among the 97 pathways, the largest pathway was ribosome, which contained 31 genes, followed by spliceosome (19), protein processing in endoplasmic reticulum (18), oxidative phosphorylation (15) and plant hormone signal transduction (14). Approximately 91% (2,015) and 66% of the DEGs (1,451) were successfully annotated using the NCBI's NR database and the Swiss-Prot database, respectively ( Table 4).

Identification of seed weight-related genes in the B. napus transcriptome and QTL region
Information on the seed weight (seed size) genes in rice, A. thaliana and B. napus were used to study the pathways, biological processes, key enzymes and proteins related to cell division, cell cycle, ubiquitin-proteasome, auxin response factor, Ser/Thr phosphatase, cyclin, and serine carboxypeptidase. The KEGG pathway analysis identified 6 and 9 DEGs for the ubiquitinmediated proteolysis (ko04120) (Fig 5A) and proteasome (ko03050) (Fig 5B) pathways, respectively ( Table 5). Analysis of the related enzymes and proteins of seed weight in the Swiss-Prot and NR databases (  were annotated as encoding the key enzymes and proteins of ubiquitin-protease biological processes. Two genes encoded cell cycle-related enzymes and protein cyclin-dependent kinases and cyclin, and one up-regulated gene encoded the auxin response factor. Four genes (1 upregulated and 3 down-regulated) encoded serine carboxypeptidase, and 6 genes (4 up-regulated and 2 down-regulated) encoded serine/threonine-protein phosphatase and its regulatory subunit or isozyme. The putative pathways of the above DEGs involved in seed weight formation were shown in Fig 6.  On chromosome A09 of B. napus, a hot QTL region of~0.58 Mb between nucleotides 25,401,885 and 25,985,931 was identified to contain 91 candidate genes associated with seed weight according to SLAF-seq in our previous studies [2]. The information for these 91 candidate genes is presented in S8 Table. A comparison of the 91 candidate genes obtained by SLAF-seq and the 2,205 DEGs in the B. napus seed transcriptome identified one gene (GSBRNA2T00037121001) that is involved in inorganic ion transport and metabolism and encodes a ferritin according to the annotated information. Identification of the DEGs in QTL regions can substantially narrow down the number of candidate genes. The annotation of all these genes in further analyses will provide important information.

Quantitative real-time reverse transcription PCR (qRT-PCR) analysis of DEGs
To verify the expression level of DEGs detected by RNA-seq, 10 DEGs that were predicted to encode the key enzymes or proteins related to seed weight and the DEG GSBRNA2T00037121001 in the QTL region of TSW were selected for qRT-PCR validation. Among these 10 randomly selected DEGs, 5 genes were up-regulated (higher expression levels in T3 than T4), whereas the  remaining 5 genes were down-regulated (lower expression levels in T3 than T4) (Fig 7). The DEG in the QTL region was still up-regulated as shown by the transcriptome analysis. The data obtained by the qRT-PCR was consistent with the RNA-seq results, which suggests the reliability of the transcriptome database.

Conclusions
In this study, RNA-seq technology was used to reveal the immature seed transcriptome of two extremely different lines (large-seed line and small-seed line) in B. napus. A total of 2,251 new genes and 2,205 DEGs were identified. According to the analysis of annotated information, the identified candidate genes related to seed weight mainly encoded the key enzymes or proteins. A DEG (GSBRNA2T00037121001) that was previously identified in the QTL region for seed weight might be an important candidate gene for further seed weight research in B. napus. Several important DEGs were analyzed by qRT-PCR to verify the RNA-seq results. This work has provided new insights into the molecular mechanisms underlying seed weight-related biosynthesis and has laid a solid foundation for further improvements to the seed yield of oil crops. Further annotation of those new genes and DEGs will also provide rich information and a reference transcriptome for future genome-wide gene expression research in B. napus.

Plant materials
In 2005, a large-seeded and a small-seeded B. napus germplasms were collected. And a few progeny lines from cross hybridization between these two germplasms were identified with heavier thousand-seed weight than large-seed parent, and some progeny lines had lighter TSW than small-seed parent. Then two progeny lines with extremely different TSW were selected to perform microspore culture to obtain their DH lines, named DH-G

Phenotypic observations
The seeds of three plants from the two lines were respectively harvested to measure the seed weight and seed size at the yielding time. The samples were oven dried at 80˚C to constant weight. In this experiment, the thousand seed weight of each line was evaluated as the means of three replicated weights of 1000 seeds. The seed size was measured from the mean values of the diameter of three seeds of each line, with three replications. The fresh weight of immature seeds at each time point from 1 to 6 weeks after flowering were determined by the mean values of 100 seeds, with three replicates of each line.

RNA extraction, cDNA library construction and sequencing
Total RNA was isolated from the seeds at six different stages from 1 to 6 weeks after flowering using The mRNA-seq libraries were prepared using the Illumina TruSeq RNA Sample Preparation Kit (Illumina Inc., San Diego, CA, USA), and the mRNA isolation, fragment interruption and RNA-seq were performed by the Biomarker Technologies Corporation according to their standard protocol. Ultimately, the libraries were constructed for sequencing using the Illumina HiSeqTM 2500 sequencing platform (Illumina Inc., San Diego, CA, USA) at Biomarker Technologies Corporation in Beijing. The adaptor sequences, empty reads and low quality sequences were filtered from the raw reads. Real-time monitoring was performed for each cycle during sequencing and the ratio of high-quality reads with quality scores greater than Q30 (indicating a quality score of 30 and a 0.1% chance of an error; 99.9% confidence) for the raw reads and guanine-cytosine (GC) content was calculated for quality control.

Identification of new genes and differentially expressed genes (DEGs)
Tophat (v2.0.12) is a fast mapping tool for RNA-seq reads that can identify splice junctions between exons [26]. Cufflinks can be used to assemble the reads into transcripts based on the mapping results [27]. The software programs Tophat and Cufflinks were used to blast the sequencing reads with the reference genome of B. napus (1.2 Gb, download link: http://www. genoscope.cns.fr/brassicanapus/data/ [28]) for the differential gene and transcript expression analyses. This process could reveal new genes that have not been previously annotated using reference genomes and could also evaluate the abundance of gene expression. The Fragments Per Kilobase of exon per Million fragments mapped (FPKM) method was used to calculate the abundance of gene expression. DESeq [29] is suitable for analysing biological duplicate samples obtained from DEG screening, and EBSeq [30] is suitable for non-biological duplicate samples. During the DEG screening, a false discovery rate (FDR) <0.001 and fold change> = 8 were considered standard values. Here, the FDR was obtained via a P-value correction and considered the key index of the differential gene expression analysis.
Fold-change formula: Fold change ¼ log 2 ðnormalized expression level of DEG in T4=T3Þ First, the fold change and FDR were calculated from the normalized expression. If the DEG fold change is > = 8, then a FDR <0.001 indicates that the DEG was significantly different between T3 and T4.

Functional annotation
Gene ontology (GO) is the de facto standard for gene functionality descriptions, and it is widely used in functional annotations and enrichment analyses [31]. The Cluster of Orthologous Groups of proteins (COG) database is based on the phylogenetic relationships among bacteria, algae and eukaryotes. Gene products in an orthologous relationship can be classified using the COG database [32]. In an organism, different genes coordinate to exert biological functions. Pathway analyses are helpful for further interpreting gene functions. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database is the main public database on pathways [33]. In this study, a functional enrichment analysis of new genes and differentially expressed genes was conducted by performing a blast search against the GO, COG, KEGG pathway, Swiss-Prot [34] and Non-redundant protein (NR) [35] databases using BLAST software [36].