Transcriptome Analysis of Leaf Tissue of Raphanus sativus by RNA Sequencing

Raphanus sativus is not only a popular edible vegetable but also an important source of medicinal compounds. However, the paucity of knowledge about the transcriptome of R. sativus greatly impedes better understanding of the functional genomics and medicinal potential of R. sativus. In this study, the transcriptome sequencing of leaf tissues in R. sativus was performed for the first time. Approximately 22 million clean reads were generated and used for transcriptome assembly. The generated unigenes were subsequently annotated against gene ontology (GO) database. KEGG analysis further revealed two important pathways in the bolting stage of R.sativus including spliceosome assembly and alkaloid synthesis. In addition, a total of 6,295 simple sequence repeats (SSRs) with various motifs were identified in the unigene library of R. sativus. Finally, four unigenes of R. sativus were selected for alignment with their homologs from other plants, and phylogenetic trees for each of the genes were constructed. Taken together, this study will provide a platform to facilitate gene discovery and advance functional genomic research of R. sativus.


Introduction
The Brassicaceae include many vegetables with great economic value, one of the most familiar of which is the root vegetable radish (Raphanus sativus). Widely cultivated in the Far East, including China, Korea and Japan, R. sativus contains many varieties with different sizes, colors, and planting times depending on their intended use. Besides being an important edible crop, R. sativus is also an important source of medicinal compounds [1]. For example, the roots of R. sativus contain multiple peroxidases that could be used for a range of medicinal applications [2]. Recently, high-throughput sequencing was used to reveal miRNA and transcriptome profiles of R. sativus root, which will expand the understanding of R. sativus miRNA function in multiple biological processes [3,4]. Similar to Brassica rapa, the genome of R. sativus is relatively small and diploid (2n = 18). However, given that only one tissue (root) transcriptome profiling of R. sativus has been reported, the very limited availability of useful genomic information for R. sativus has greatly impeded progress in improving our understanding of the functional genomics and medicinal potential of R. sativus.
mRNA-Seq technology has emerged as an effective transcriptome analysis tool that is now widely used in biological research, clinical field, and drug development. mRNA-Seq provides a more sensitive and accurate way of analyzing transcriptome profiles than microarray and other technologies [5][6][7]. For instance, mRNA-Seq can systematically analyze transcriptional activity at the single-nucleotide level [8]. Accordingly, mRNA-Seq is widely used not only to detect transcriptome profiles but also to identify differentially expressed genes, discover unknown transcripts, and accurately identify cSNP (Coding Single Nucleotide Polymorphism) in multiple non-model plants [9][10][11][12][13][14][15][16][17][18]. More importantly, mRNA-seq does not rely on the availability of any genomic information for the species being studied. Therefore, mRNA-Seq has been widely used for transcriptome analysis of non-model organisms without genome information and to quantify the expression of genes of interest [19][20][21][22][23][24][25][26]. The de novo assembly of transcriptome data facilitates the functional genomic analysis of species for which a genome sequence is not available, and can also be used for genome annotation and the identification of microsatellite or SNP markers [27][28][29][30][31][32].
In this study, we first applied mRNA-Seq technology to characterize the transcriptome profiles of leaf tissues in R. sativus. The Trinity program [19] was used for the de novo assembly of the R. sativus transcriptome, and a total of 28,410 unigenes were generated. All of the unigenes were annotated and analyzed by using the BLAST algorithm (http:// www.ncbi.nlm.nih.gov/blast/Blast.cgi) to identify the unigene sequences through comparisons with various protein and nucleotide databases. Importantly, KEGG analysis showed that spliceosome and alkaloid biosynthesis were the important pathways in the bolting stage of R. sativus leaf tissues. Furthermore, approximately 6,000 SSRs with various motifs were developed in the assembled unigene library. Finally, we aligned four R. sativus unigenes with their homologs from other species within the Brassicaceae, and used the homology data to build phylogenetic trees using the neighbor-joining method.
Taken together, the de novo assembly and further annotation and analysis of transcriptome will provide comprehensive genome resources that will promote further functional genomic analysis of R. sativus and its closest relatives.

Plant material and RNA isolation
The R. sativus big root radish was provided by Prof. Jiangsheng Wu (Huazhong Agricultural university, China), which was cultivated in the experiment field of Huazhong University of Science and Technology in 2012. The leaves were harvested after it had grown for three months and stored at -80°C. Total RNA was isolated by using TRIzol (Invitrogen) according to the manufacturer's instructions, and DNase I (Promega) was used to remove contaminating genomic DNA. Total RNA was quantified by using a NanoDrop spectrophotometer (Thermo Fisher Scientific, Inc.), and the purity of the total RNA was detected by measuring both the A260/280 and A260/230 ratios. Furthermore, Agilent 2100 Bioanalyzer (Agilent Technologies, Inc.) was utilized to assess the integrity of the RNA samples.The purified RNA was dissolved in RNase-free water and stored in a -80°C freezer until subsequent analysis.

cDNA preparation and sequencing
The TruSeqTM RNA Sample Preparation Kit (Illumina, Inc.) was used to construct a cDNA library for sequencing of the R. sativus transcriptome. In brief, poly-A mRNA was purified from 10 μg of total RNA using oligo (dT) magnetic beads (NEB). The purified poly-A mRNA was then fragmented into fragments 200-500 bp in sizeThe fragmented mRNA pieces were used for the synthesis of first-strand cDNA with hexamer primers and reverse transcriptase (Promega), and then DNA polymerase I and RNase H were used to synthesize the second-strand cDNA. The cDNA fragments were then purified, end-repaired, A-tailed, and ligated to index adapters (Illumina). The ligated products were subsequently PCR-amplified to generate the final cDNA libraries. The cDNA library was sequenced using the Illumina GA IIX sequencing platform and the average length of sequenced reads was 75 nt.

De novo transcriptome assembly
After sequencing of the cDNA library, base-calling using Illumina Pipeline Software was used to transform the raw image data generated into sequence information. The clean reads were obtained and deposited in NCBI Sequence Read Archive (SRA) Sequence Database with accession number SRP022926 by trimming adaptor sequences and removing empty reads and ambiguous nucleotides ('N' in the end of the reads). The Trinity program [19] was further used to assemble the clean reads to generate non-redundant unigenes. Briefly, reads with overlaps were assembled to generate contigs, which were joined into scaffolds that were further assembled through gap filling to generate sequences called unigenes. There is no single absolutely optimal k-mer length for transcriptome assembly because transcriptome coverage is highly variable owing to differential gene expression in cells. Here, a default kmer size of 25 was set for the de novo assembly of the R. sativus transcriptome. Default values were used for all other parameters, and the assembled unigenes used for further annotation were all at least 200 bp in length.

Transcriptome annotation
All of the assembled unigenes were analyzed using the BLAST algorithm to search for homologs in the NCBI nonredundant protein (Nr), Swissprot, Pfam and Trembl databases. The BLASTX algorithm was used to identify homologous sequences with an E-value cut-off of 10 -5 . Based on the BLAST hits identified by interrogation of the Nr and Swiss-Prot databases, GO (Gene Ontology) annotation was performed using BLAST2GO [33] to obtain cellular component, molecular function, and biological process terms. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database is used extensively to reveal molecular interaction network and metabolic pathways [34]. KEGG pathways annotation was performed by mapping the sequences obtained from BLAST2GO to the contents of the KEGG metabolic pathway database (http://david.abcc.ncifcrf.gov/).

RT-PCR validation of unigenes
Total RNA from R. sativus leaves was reverse-transcribed by using SuperScript III Reverse Transcriptase (Invitrogen) and oligo(dT) 18 . Five assembled unigenes were randomly selected for RT-PCR validation. Forward (Fwd) and reverse (Rev) primers were designed using Primer3. The sequences of primers used for validation of assembled R. sativus unigenes were also listed in Table S1.

SSR prediction
Simple sequence repeats (SSRs) provide the most popular molecular-marker method used in quantitative trait loci (QTL) exploration, genetic analysis, and high-throughput genome mapping [35][36][37]. Mining of the SSRs present in the assembled unigenes of R. sativus was performed as described previously [32]. Briefly, five categories of SSRs, including those with dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide motifs, were classified. For each category of SSR, the minimum number of contiguous repeat units is five. Statistical analysis was used to investigate the number of SSRs with each type of motif and the distribution of the lengths of repeat units.

Phylogenetic analysis of unigenes
The R. sativus unigenes identified after transcriptome assembly were first aligned with the same genes of Brassicaceae plants retrieved from Genbank after interrogation of the database using CLUSTAL X version 2.0 [38]. Subsequently, based on the nucleotide sequence alignments, MEGA4.0 software was used to generate phylogenetic trees of R. sativus unigenes and their homologs from other plants, using the neighbor-joining approach [39]. The phylogenetic trees were furnished with 1,000 pseudoreplicate bootstrap values at each node. The accession numbers for the sequences of U2AF35, GIGANTEA, EMB2369 and ATIMD2 unigenes that were included in the tree were listed as follows: R.

mRNA-Seq and de novo transcriptome assembly
High-throughput sequencing technology has been used extensively to explore the transcriptome profiles of many nonmodel plant species. Here, we report the use of Solexa highthroughput sequencing technology to determine transcriptome profiles of R. sativus. A cDNA library of mRNAs from R. sativus was sequenced using the Illumina GA II X instrument, and a total of 27,976,916 raw reads were produced. A total of 21,978,870 clean reads were generated after removing adaptor sequences and low-quality reads. All clean reads were mapped to the B. rapa genome using Bowtie program. The result revealed that the clean reads were mainly distributed in the CDS (coding sequence) region ( Figure 1A), which demonstrated that the mRNA-Seq results were highly reliable. The Trinity program was then used for de novo assembly of all clean reads, with generation of a total of 28,410 unigenes with an average length of 394 bp and an N 50 of 422 bp ( Figure 1B and Table 1). Among the unigenes, the shortest and longest unigenes are 200 bp and 3,392 bp, respectively. Moreover, 19,107 unigenes were within the 200-400 bp range, and 8,499 unigenes were within the 400-1,000 bp range. The observation that only 804 of the unigenes were longer than 1,000 bp might be explained by the documented roles of ubiquitous alternative splicing events and repeats element in perturbing de novo assembly of longer transcripts [40]. To evaluate the accuracy of de novo assembly of R. sativus, the clean reads obtained using mRNA-Seq were mapped to the unigenes. The results showed that 68.2% (14,994,188 clean reads) of clean reads could map to the assembled unigenes. Among 14, 994,188 mapped reads, 11,818,198 (78.8%) reads uniquely matched to the unigenes and 3,175,990 (21.2%) mapped to multiple locations on unigenes. We randomly selected five assembled unigenes of R.sativus for RT-PCR validation. The RT-PCR data indicated that all five selected unigenes of R.sativus got right amplifications ( Figure S1). In addition, given that both B. rapa and R. sativus belong to the Brassicaceae family, we assume that the R. sativus geneome contains similar number of genes as B. rapa, which has 41,174 protein coding genes with an average length of 2,015 bp [41]. In order to estimate the level of transcripts coverage of R.sativus, indirect evaluation was performed by comparing the number of unigenes (28,410) and the average length (394 bp) to the B. rapa genes. The assembly generated in this study was estimated able to cover 13.5% of the R. sativus transcriptome.
Meanwhile, the sequencing data of root of R.sativus was downloaded from NCBI Sequence Read Archive and assembled using Trinity program. In order to investigate the tissue specificity of gene expression between root and leaf of R.sativus, we further performed a comparison of assembled unigenes between root and leaf. The result showed that a total of 50,700 unigenes of root of R.sativus were generated with an average length of 1,182 bp (Table S3). Furthermore, BLASTx analysis revealed that ~9 2% (26,225) leaf unigenes of R.sativus were perfectly matched to root unigenes of R.sativus, which indicated that the transcriptome profiles of root and leaf of R.sativus are very similar, whereas ~8 % differential unigenes between root and leaf might be associated with tissue-specific gene expression.
Taking into account the results of this transcriptome assembly derived solely from one tissue of R. sativus (leaf) and that there is currently no available genome sequence information for R. sativus, our data suggest the suitability of mRNA-Seq for extensive and accurate assembly of plant transcriptomes.

Annotation of the R. sativus transcriptome
In order to assess and annotate the assembled unigenes, the 28,410 unigenes generated by Trinity were subjected to BLASTx similarity analysis (E-value cutoff of 10 -5 ) involving interrogation of public protein databases, including the NCBI non-redundant protein (Nr) database, the Swiss-Prot protein database, Nt database, Pfam database and Trembl database. As a result, the mapping rate was between 54% and 91% ( Table 2). Most of the unigenes (approximately 91%) were mapped to the Nt library. Importantly, mapping of 80% of the unigenes to the Nr library suggests that most of the unigenes can be translated into proteins. The mapping rates of unigenes against the Swissprot and Trembl protein databases were 54.8% and 80.3%, respectively. Distribution analysis based on BLASTx searches showed that the unigenes of R. sativus have homologs in numerous hit a lot of plant species (Figure 2A). Among the various plant species that have protein sequence information in GenBank, the unigenes of R. sativus had the highest number of hits to sequences from Arabidopsis lyrata (47.12%), followed by Arabidopsis thaliana (35.01%), Eutrema halophilum (5.27%), Brassica napus (2.84%), Brassica rapa   Figure 2B). The high similarity of R. sativus unigenes to genes from A. lyrata and A. thaliana suggests the possibility of using the genome of A. lyrata or A. thaliana as a reference for identifying differential gene expression patterns of mRNA-Seq data.

KEGG analysis of leaf unigenes of R. sativus
The KEGG database documents connections between known information on molecular interaction networks, such as information about metabolic pathways and the functions of gene products. To better understand the biological functions of R. sativus unigenes, the assembled R. sativus unigenes were further assigned to the biochemical pathways in the KEGG database (34). A total of 7158 unigenes were assigned to 117 KEGG biochemical pathways (Table S4). The top 3 pathways (ranked by p-value) included "spliceosome" (159 unigenes, pvalue 5.12×10 -8 ), "biosynthesis of alkaloids derived from histidine and purines" (254 unigenes, p-value 2.36×10 -4 ), "Pentose phosphate pathway" (79 unigenes, p-value 7.5×10 -3 ). As shown in Figure 4, multiple R.sativus unigenes (green rectangle) were involved in the process of spliceosome assembly. Some made up the key components of spliceosome assembly including U1, U2, U4, U5 and U6 etc. Some other unigenes, such as Prp5, Prp2, Prp16, Prp17, Prp18, Prp22, Slu7, Prp22 and Prp43, directly participated in the process of spliceosome assembly. As R.sativus in bolting stage grows very rapid, this result showed that versatile alternative splicing events may occur in the bolting stage of R.sativus, which suggested that alternative splicing regulation is an general approach to affect complex plant biological processes including plant development, disease resistance and stress responses etc. In the meanwhile, KEGG analysis showed that 254 unigenes of R.sativus were involved in the pathway of alkaloid biosynthesis ( Figure S2). Given that alkaloids are crucial for defense against pathogens, this finding confirmed that R. sativus is an important source of medicinal compounds, which also suggested that alkaloid biosynthesis is a crucial physiological activity in the bolting stage of R.sativus. In the meanwhile, we performed the KEGG analysis of top 10 GO terms. Interestingly, the result showed that "alkaloid biosynthesis" and "Pentose phosphate pathway" were still major pathways and 95 unigenes were assigned to "Spliceosome" pathway (Table S5). Collectively, KEGG analysis of R.sativus unigenes shed light on the great potential of mRNA-Seq technology to identify novel genes in various plant metabolic pathways.

SSR analysis of the R. sativus transcriptome
Simple Sequence Repeat (SSR) markers, also known as microsatellites, are repeating DNA sequences of 2-6 base pairs, which are widely used as molecular markers for genetic mapping and to analyze species diversity. Here, 6,295 SSRs were predicted in 2,755 unigenes, of which 2,048 unigenes contained multiple SSRs. The SSRs included 2,919 (46.4%) dinucleotide motifs, 3,137 (49.8%) trinucleotide motifs, 229 (3.6%) tetranucleotide motifs, 7 (0.1%) pentanucleotide motifs, and 3 (0.05%) hexanucleotide motifs (Table 3). Furthermore, 112 motifs were detected among all of the SSRs identified. The most abundant SSR motif was GA/AG/TC/CT (2,153 SSRs), followed by AAG/AGA/TTC/TCT/CTT/GAA/GAG (1,052 SSRs), and then AT/TA/TG/GT (572 SSRs). The distribution of the number of repeat units in all SSRs was also investigated ( Table  3). The result revealed that most SSRs contained fewer than 10 repeat sequences, and no SSRs with more than 20 repeat sequences were observed. Although many SSR markers were identified in the Brassicaceae family, only a few SSR markers were reported in R. sativus [42][43][44][45]. Here, we predicted 6,295 SSRs from the assembled unigene library of R. sativus. These SSRs will likely be of value for genetic analyses of R. sativus and other related non-model plants.

Phylogenetic analysis of interesting genes in R. sativus
Some important genes that are widely involved in the processes of plant growth and development were further picked up for phylogeny analysis. KEGG analysis of the R. sativus unigenes revealed that many were involved in the spliceosome pathway (Table S4). U2AF35 is a component of the U2 snRNP Auxiliary Factor (U2AF), which is required for the recruitment of U2 snRNP to pre-mRNAs during spliceosome assembly [46]. Furthermore, U2AF35 is widely involved in many physiological processes in plants, including mitosis, photoperiodism, and flowering. Therefore, we aligned the U2AF35 unigene from the R. sativus transcriptome assembly with U2AF35 genes from multiple plants for which U2AF35 sequences were available in Genbank. Sequence alignment revealed that the U2AF35 gene was highly conserved among members of the Brassicaceae. A phylogenetic tree constructed on the basis of the sequence alignments demonstrated that the U2AF35 gene family can be divided into several groups in higher plants according to their nucleotide sequences. The sequence of the U2AF35 gene from R. sativus displays high homology to the U2AF35 sequences from B. rapa and B. oleracea. The U2AF35 gene can be roughly classified into four groups in higher plants, and the gene sequence identified in the transcriptome of R. sativus existed in the Brassica group. There are several species that can be found in the Brassica group such as B. napus,   GIGANTEA unigenes ( Figure S3 and Figure S4). Taken

Conclusions
Next-generation sequencing enabled de novo assembly of the transcriptome of R. sativus, an edible root vegetable that belongs to the Brassicaceae family. The assembly program Trinity generated 28,410 unigenes, which open the way for further research on gene expression patterns, functional genomics, and proteomics in R. sativus. Our characterization of the leaf transcriptome of R. sativus has not only enriched the publicly available database of sequences for members of the Brassicaceae, but will also facilitate genetic analysis of other non-model organisms. Figure S1. Expression validation of R.sativus unigenes. Five R. sativus unigenes were selected to perform RT-PCR assay. Result showed that these selected unigenes got right amplifications. (TIF) Figure S2.

Supporting Information
R. sativus unigenes involved in alkaloid biosynthesis pathway. R.sativus unigenes participating in the process of alkaloid biosynthesis derived from histidine and purine were marked with green rectangle. (TIF) Figure S3. Phylogenetic tree analysis of ATIMD2 genes in multiple plants. The phylogenetic tree of ATIMD2 unigenes in multiple plants were constructed using the neighbor-joining method. The Brassicaceae plants were denoted with red and ATIMD2 unigene of R. sativus was marked with green square. (TIF) Figure S4. Phylogenetic tree analysis of EMB2369 genes in multiple plants. The phylogenetic tree of EMB2369 unigenes in multiple plants were constructed using the neighbor-joining method. The Brassicaceae plants were denoted with red and EMB236 unigene of R. sativus was marked with green square. (TIF) Table S1.
The sequences of RT-PCR primers. The sequences of primers used for validation of R. sativus unigenes were listed in Table S1. (DOCX) Table S2. The information of four selected unigenes of R. sativus. The accession numbers and the sequences of U2AF35, GIGANTEA, EMB2369 and ATIMD2 unigenes were included in Table S2. (DOC) Table S3.
Assembly summary of root unigenes of R.sativus. The sequencing data of root of R.sativus was downloaded from NCBI Sequence Read Archive and assembled using Trinity. The assembly result was listed in Table S3. (DOCX)