De Novo Assembly and Annotation of the Chinese Chive (Allium tuberosum Rottler ex Spr.) Transcriptome Using the Illumina Platform

Chinese chive (A. tuberosum Rottler ex Spr.) is one of the most widely cultivated Allium species in China. However, minimal transcriptomic and genomic data are available to reveal its evolution and genetic diversity. In this study, de novo transcriptome sequencing was performed to produce large transcript sequences using an Illumina HiSeq 2000 instrument. We produced 51,968,882 high-quality clean reads and assembled them into 150,154 contigs. A total of 60,031 unigenes with an average length of 631 bp were identified. Of these, 36,523 unigenes were homologous to existing database sequences, 35,648 unigenes were annotated in the NCBI non-redundant (Nr) sequence database, and 23,509 unigenes were annotated in the Swiss-Prot database. A total of 26,798 unigenes were assigned to 57 Gene Ontology (GO) terms, and 13,378 unigenes were assigned to Cluster of Orthologous Group categories. Using the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database, we mapped 21,361 unigenes onto 128 pathways. Furthermore, 2,125 sequences containing simple sequence repeats (SSRs) were identified. This new dataset provides the most comprehensive resource currently available for gene expression, gene discovery, and future genomic research on Chinese chive. The sequence resources developed in this study can be used to develop molecular markers that will facilitate further genetic research on Chinese chive and related species.


Introduction
Chinese chive (A.tuberosum Rottler ex Spr.) is a perennial plant that is widely cultivated worldwide. It is commonly used as a spice in Asian cuisines, especially in China, Japan, and Korea. Chinese chive is rich in carbohydrates, proteins, mineral salts and vitamins. As a member of the Allium family, Chinese chive contains high concentrations of organic sulfur compounds, which confer characteristic flavors [1] and human health benefits [2]. Chinese chive has been used as a traditional medicine for the treatment of common colds, headaches, and cardiovascular diseases such as increased reactive oxygen species, high blood pressure, high cholesterol, platelet aggregation, and blood coagulation.
The genomes of many Allium species are very large relative to other eukaryotes; in 30 Allium species, the genome size ranges from 6860 to 30,870 Mbp per 1C. Chinese chive is a tetraploid (2n = 4x = 32) plant with a nuclear genome of 15G per 1C. Its genome is slightly smaller than the onion genome, 30 times larger than the rice genome and approximately 100 times larger than the Arabidopsis thaliana genome. Molecular markers, specific functional genes and other genomic resources in Chinese chive are very limited compared with other vegetable taxa such as the gourd and solanaceous vegetables. Transcriptome sequencing is a costeffective and frequently used strategy for the genome-wide quantification of absolute transcript levels, the development of molecular markers, and the identification of transcripts [3][4][5].
In recent years, the emergence of next generation sequencing (NGS) technology has offered a powerful and cost-efficient tool for the generation of transcriptomic datasets in non-model species using various platforms such as the Roche 454, Illumina HiSeq, and Applied Biosystems SOLiD [6,7]. RNA sequencing has been used for the genome-wide quantification of absolute transcript levels, the identification of novel genes, the delineation of transcript structure (including 5 0 and 3 0 ends, introns, and exons) and the mining of molecular markers [4,8,9]. Several non-model organisms such as the Jerusalem artichoke, Sophora japonica, and Youngia japonica have been studied by transcriptome sequencing [3,6,10], which has provided a better understanding of these plants.
In the present study, we used the Illumina HiSeq 2000 platform to develop the Chinese chive transcriptome dataset. Raw reads comprising 4.95 Gbp were assembled de novo into 53,837 unigenes. The assembled unigenes were annotated against public protein databases followed by GO, COG and KEGG classification. Moreover, 2,453 simple sequence repeats (SSRs) were identified. The transcriptome data generated in this study provide an invaluable genomic resource for future research on Chinese chive. Additionally, the SSR markers developed here should facilitate marker-assisted selective breeding, gene mapping and linkage map development in Chinese chive.

Ethics statement
All of the necessary permits for field studies were obtained. The authority responsible for A. tuberosum farming, Shandong Agricultural University, provided permission to collect the samples for our research.

Plant materials and RNA extraction
A. tuberosum seedlings were grown in the fields of the College of Horticulture Science and Engineering of Shandong Agricultural University in Tai'an, Shandong Province, China, under normal cultivation conditions. Leaves, shoots and roots were collected, and the tissues were then immediately frozen and stored in liquid nitrogen. Total RNAs were extracted using TRIzol Reagent and then treated with DNase I according to the manufacturers' instructions. The RNA quality was verified using a 2100 Bioanalyzer and gel electrophoresis, and the library was sequenced on an Illumina HiSeq 2000 platform.

Library construction and Illumina sequencing
Purification of the mRNAs was performed using the OligoTex mRNA mini kit mRNAs were chemically fragmented into short pieces using RNA Fragmentation Reagent, and cDNAs were synthesized using the mRNA fragments as templates. The cDNA fragments were purified and dissolved in EB buffer for end repair and single-nucleotide A (adenine) addition. The short fragments were then ligated with sequencing adaptors, and the products were purified and enriched by PCR to generate the final library. The library was sequenced on an Illumina HiSeq 2000 platform. Before assembly, the raw reads were filtered by removing adaptors and lowquality sequences with unknown nucleotides larger than 5% and low quality reads which the percentage of low quality bases (base quality10) is more than 20%. De novo assembly of the clean reads was conducted with the short-read assembly program "Trinity" (Release 2013-02-05) [11]. The parameters for Trinity were as follows: seqType fq, min_contig_length 100, min_glue 3, group_pairs_distance 250, path_reinforcement_distance 85, min_kmer_cov 3, SS_lib_type FR.
Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly. Inchworm assembles reads into linear transcript contigs in the following steps. Constructs a kmer dictionary from all sequence reads (in practice, k = 25); Selects the most frequent k-mer in the dictionary to seed a contig assembly (excluding both low-complexity and singleton kmers); extends the seed in each direction by finding the highest occurring k-mer with a k − 1 overlap until it cannot be extended further, then reports the linear contig; repeats the above two steps, starting with the next most abundant k-mer, until the entire k-mer dictionary has been exhausted. Next, Chrysalis clusters the Inchworm Contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs. Finally, Butterfly processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes. The result sequences of trinity is called Unigenes.

Sequence annotation
The assembled unigenes were used for BLASTn searches and annotation against the NCBI non-redundant nucleotide sequence (Nt) database with an E-value cut-off of 10-5. BLASTx alignment (E-value <1e-5) was performed between the unigenes and the protein databases, including Nr (last updated in March of 2013), the Swiss-Prot protein (Release 2013_03), KEGG (Release 63.0), and COG database (last updated in September of 2009). With Nr annotation, we used the Blast2GO program [15] to predict GO terms related to molecular functions, cellular components, and biological processes. After obtaining GO annotations for every unigene, we used the WEGO software [16] to conduct GO functional classification for all unigenes and to understand the distribution of gene functions throughout the species at the macro level.

SSR detection and validation
SSRs were detected using the MISA program (http://pgrc.ipk-gatersleben.de/misa/). The minimum repeat number was set to six for di-nucleotides, to five for tri-and tetra-nucleotides and to four for penta-and hexa-nucleotides. Primer3 software was used to design the primer pairs. The major parameters for primer pair design were set as follows: no SSRs were present in the primer; primers aligned to unigene sequences with the 5' site were allowed 3 mismatches; primers aligned to unigene sequences with the 3' site were allowed 1 mismatch; primers that aligned to more than one unigene were removed; SSRs were validated using SSR-finder (http://www. fresnostate.edu/ssrfinder/); and both-hit primers were selected.

Results and Discussion
Illumina sequencing and de novo assembly To generate a global overview of the A. tuberosum transcriptome, sequence analysis and assembly were performed using the Illumina HiSeq 2000 sequencing platform. After stringent quality assessment and data filtering, 51,968,882 clean paired-end sequence reads (NCBI SRA accession no. SRR1020564) with total of 4,677,199,380 nucleotides (nt) were produced with an average length of 90 bps for each short read ( Table 1). The average GC content of the clean reads was 43.86%. Q20, the proportion of nucleotides with quality value larger than 20 in reads, was 97.81%.
Using the Trinity program, the obtained short-read sequences were assembled into 150,154 contigs with an average length of 289 bp and an N50 length of 444 bp. A total of 18,528 contigs, which accounted for 12.21% of the contigs, were longer than 500 bp (Fig 1). The contigs were further clustered and assembled, resulting in 60,031 unigenes, among which 10,863 genes (18.09%) were longer than 1 kb. The average length of these unigenes was 631 bp, and the N50 length was 900 bp ( Table 2). The length distributions of the contigs and unigenes are shown in Fig 1. The results suggest that the sequencing data of the Chinese chive transcriptome were effectively assembled. These results also indicate that the throughput and sequencing quality was high enough for further analysis. Because of the relatively large genome sizes of Allium species, full-genome sequencing has not been conducted in these species. Transcriptome sequencing has offered a new avenue for generating abundant sequence information from any organism [17,18]. Transcriptome sequencing has been recently applied to several Allium species. A total of 127,933 garlic unigenes with an average length of 363 bp were generated by de novo assembly [12]. A set of 42,881 unigenes with an average length of 787.30 bp were obtained from Welsh onion [19]. A total of 165,179 unigenes with an average length of 1,228.9 bp were generated from onion [20]. Kamenetsky et al [21] generated 239,116 contigs with an average length of 715 bp from garlic. Our reads roughly in the middle as compared to the average lengths of unigenes or contigs obtained from these Allium species. The data obtained from RNA-Seq analyses will provide an important basis for future gene cloning and transgenic engineering studies.

Functional annotation and classification
The unigenes were annotated by aligning them with several protein databases, including the Nr database, Nt database, Swiss-Prot, KEGG, COG, and Gene Ontology. In total, 36,523 unigenes were annotated to the six databases (S1 Table). The Annotation Rate of Chinese Chive unigenes was 60.84%, which was higher than Garlic (48.31%) and Onion (58.85%) ( Table 3). A total of 23,508 unigenes did not significantly match to any known protein in the public databases. Similar search outcomes have been observed in other studies [3,6]. These unigenes may be novel transcribed sequences in the Allium species. Some unigenes may have also been too short to allow for statistically meaningful matches. As shown in Table 3 35,648 unigene matches were found in the Nr database, 22,798 unigenes were successfully annotated in the Nt database, and 23,509 unigenes were similar to proteins in the Swiss-Prot database. The E-value distribution of the top matches showed that 80.72% of the Nr-mapped sequences had values in the range of 0−1.0 x 10 −30 , and 62.58% of the unigenes had a high Evalue score (E-value < 10 −45 ) (Fig 2A). These results reflect the validity and reliability of our de novo assembly, suggesting that the sequences have a good assembling quality. The distribution of sequence similarities showed that 88.10% of the Nr-annotated sequences had similarities greater than 40%, and 15.13% of the sequences shared more than 80% similarity with known sequences (Fig 2B). Additionally, the unigenes were compared to sequences of other plant species; 6,502 (18.2%) unigenes were best matched to sequences from Vitis vinifera, whereas 3,063 (8.6%), 2,317 (6.5%), 2,145 (6.0%), 2,016 (5.7%), 2,004 (5.6%), and 1,990 (5.6%) were matched to sequences from Oryza sativa Japonica Group, Prunus persica, Ricinus communis, Brachypodium distachyon, Populus trichocarpa, and Zea mays, respectively (Fig 2C).

SSR discovery
Using MISA software, the assembled sequences were scanned to explore SSR profiles. In total, 2,125 sequences containing 2,279 potential SSRs were identified from the 60,031 assembled sequences. The percentage (3.8%) of mined SSRs in this study was similar to those in the reports for other Lilium species and cultivars [26,27]. A total of 142 sequences contained more than one SSR, and 79 SSRs were present in compound formation. On average, the SSR frequency in the Chinese chive transcriptome was 3.80%, and one SSR could be found every 16 Table 4). The most abundant motif in the dinucleotide class was AC/GT (273, 13.56%), followed by AG/CT (231, 11.47%), AT/AT (97, 4.82%) and the least represented motif was CG/CG (10, 0.5%) ( Table 5). The dominant repeat motifs in the tri-nucleotide class was AAG/CTT (303, or 13.30%), ATC/ATG (174, or 8.64%), AGC/CTG (155, or 7.7%) and AGG/CCT (154, or 7.65%), as shown in Table 5. All of the above tri-nucleotide repeats comprised 71.47% of the characterized tri-nucleotides. For Chinese chive, SSR lengths ranged from 12 to 136 nt. The majority of tri-nucleotide repeats lengths ranged from 15 to 30  bp (data not shown). A total of 1,937 primer pairs were specifically designed from 2,125 sequences (S3 Table), which provide a good resource for molecular marker-assisted breeding.
Supporting Information S1

Author Contributions
Conceived and designed the experiments: X-DS. Performed the experiments: S-MZ L-MC X-DS. Analyzed the data: S-MZ X-DS. Contributed reagents/materials/analysis tools: S-QL X-FW. Wrote the paper: S-MZ X-DS.