Transcriptome Sequencing of Mung Bean (Vigna radiate L.) Genes and the Identification of EST-SSR Markers

Mung bean (Vigna radiate (L.) Wilczek) is an important traditional food legume crop, with high economic and nutritional value. It is widely grown in China and other Asian countries. Despite its importance, genomic information is currently unavailable for this crop plant species or some of its close relatives in the Vigna genus. In this study, more than 103 million high quality cDNA sequence reads were obtained from mung bean using Illumina paired-end sequencing technology. The processed reads were assembled into 48,693 unigenes with an average length of 874 bp. Of these unigenes, 25,820 (53.0%) and 23,235 (47.7%) showed significant similarity to proteins in the NCBI non-redundant protein and nucleotide sequence databases, respectively. Furthermore, 19,242 (39.5%) could be classified into gene ontology categories, 18,316 (37.6%) into Swiss-Prot categories and 10,918 (22.4%) into KOG database categories (E-value < 1.0E-5). A total of 6,585 (8.3%) were mapped onto 244 pathways using the Kyoto Encyclopedia of Genes and Genome (KEGG) pathway database. Among the unigenes, 10,053 sequences contained a unique simple sequence repeat (SSR), and 2,303 sequences contained more than one SSR together in the same expressed sequence tag (EST). A total of 13,134 EST-SSRs were identified as potential molecular markers, with mono-nucleotide A/T repeats being the most abundant motif class and G/C repeats being rare. In this SSR analysis, we found five main repeat motifs: AG/CT (30.8%), GAA/TTC (12.6%), AAAT/ATTT (6.8%), AAAAT/ATTTT (6.2%) and AAAAAT/ATTTTT (1.9%). A total of 200 SSR loci were randomly selected for validation by PCR amplification as EST-SSR markers. Of these, 66 marker primer pairs produced reproducible amplicons that were polymorphic among 31 mung bean accessions selected from diverse geographical locations. The large number of SSR-containing sequences found in this study will be valuable for the construction of a high-resolution genetic linkage maps, association or comparative mapping and genetic analyses of various Vigna species.

Introduction 31 accessions, 8 genotypes were used for initial screening and validation of marker polymorphism. These mung bean accessions were obtained from the National Center for Crop Germplasm Resources Preservation located in Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing, China and were grown for DNA and RNA extractions in a research field at the same location.

RNA extraction
Tissue samples of roots, stems and leaves were collected at 15 days after sowing and quickly frozen in liquid nitrogen for storage at -80°C. RNA from each of the samples was isolated using the Trizol Reagent with manufacturer's instructions (Invitrogen, Life Technologies, Carlsbad, USA). Total RNA was then treated with RNase-free DNase I (Takara, Kyoto, Japan) for 30 min at 37°C to remove residual DNA. RNA quality was verified using a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) and was also checked by RNase free agarose gel electrophoresis. The concentration of the total RNA was further quantified with a RNA NanoDrop (Thermo Fisher Scientific Inc., Waltham, MA, USA).

cDNA library construction
Aliquots of 20 μg each of total RNA from the two different mung bean cultivars were separately processed for cNDA library construction. In both cases, a concentration of 400 ng/μl, OD260/280 = 1.8~2.2, RNA 28S:18S 1.0, and RNA Integrity Number (RIN) 7.0 was used for the preparation of a cDNA libraries. Poly-T oligonucleotide labeled magnetic beads (Illumina Inc., San Diego USA) was used to isolate poly (A) mRNA from the total RNA. Subsequently, the isolated mRNA was purified and fragmented into smaller pieces (200-700 nt) using divalent cations at 94°C for 5 min. First strand cDNA was synthesized with SuperScript II reverse transcriptase and random primers using the small fragment RNAs as templates. Second-strand cDNA synthesis was carried out using GEX second strand buffer, dNTPs, RNase H and DNA polymerase I. The cDNA fragments were further processed with end repair and phosphorylation using T4 DNA polymerase, Klenow DNA polymerase, and T4 polynucleotide kinase. The repaired cDNA fragments were 3' adenylated using Klenow enzyme (Exo-) before end-ligating with Illumina paired-end adapters. The products from this ligation reaction were electrophoresed on a 2% (w/v) TAE-agarose gel and purified to select templates of different sizes for downstream enrichment. Only cDNA fragments of 200 bp (±25 bp) were excised from the gel and subjected to PCR. Thermocycling enrichment consisted of 15 cycles of PCR amplification performed using PCR primers PE1.0 and PE2.0 with Phusion DNA Polymerase.
Illumina Sequencing, data filtering and de novo assembly The new cDNA libraries of the two mung bean cultivars were sequenced with Illumina pairedend sequencing technology [23] and an Illumina Hiseq 2000 sequencer which automatically collected the data and generated FASTQ files (.fq) containing raw data for all the reads. The files for ZL1 and V6, based on cultivar, were submitted to the sequence read archive (SRA) database at Genbank (www.ncbi.nlm.nih.gov), where they were combined and given the accession number SRP043316. The raw data was stringently filtered for preliminary assembly. All reads with more than 10% of bases with a poor quality score (Q<20), or non-coding RNA (such as rRNA, tRNA and miRNA), as well as ambiguous sequences containing an excess of "N" nucleotide calls or adaptor contamination, were removed. We also discarded the reads that did not pass the Illumina failed-chastity filter according to the relation "failed-chastity 1", with a chastity threshold of 0.6 on the first 25 cycles. After this, de novo transcriptome assembly was performed with the software Trinity [24] by uploading the high-quality reads onto a computer for further analysis to 1) reduce the graph complexity by resolving repetitive sequences shorter than the read length in the graph; 2) clip the short tips in the graphs with lengths less than 2K (58 bp); 3) filter the low-coverage links that appeared only once along with their related edges; 4) merge the detected bubbles into a single path if the sequences of the parallel paths had a difference of fewer than four base pairs with >90% identity. After all these steps, the connections on the simplified graphs were broken at any repeat boundaries. These bioinformatics processes resulted in sequences without redundancy that contained the least amount of "N" nucleotide calls un-extended on either end. Only these stringently compiled sequences were defined as unigenes.

Unigene annotation and classification
The annotation of unigenes was performed using various bioinformatics procedures. The unigenes were aligned with BLASTX to four protein databases (NCBI non-redundant or Nr proteins, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes or KEGG and euKaryotic Ortholog Groups or KOG) and one nucleotide database (NCBI nucleotide or Nt sequences) with an E-value threshold of 1.0E-5 for all except KOG with a threshold of 1.0E-3 [25,26]. The proteins with highest sequence similarity were retrieved and annotated to each unigene. With nucleotide based annotation, Blast2GO [27] software was used to obtain GO annotation categories defined by molecular function, cellular component and biological process ontologies. The KOG database was used to predict possible functions while pathway assignments were determined with KEGG.

EST-SSR search and primer design
The MIcroSAtellite (MISA) search engine (http://pgrc.ipk-gatersleben.de/misa) was employed for SSR mining and identification. The minimum numbers of repeats used for selecting the EST-SSRs were ten for mono-nucleotide based loci, six for di-nucleotide loci, five for tri-nucleotide loci and three for all larger repeat types (tetra-to hexa-nucleotide motifs). SSR marker primer pairs were designed based on sequences flanking the selected microsatellite loci using the software package Premier 5.0 (PREMIER Biosoft International, Palo Alto, CA) with targeted sizes of PCR products in the range between 100 to 300 bp.

Marker validation and genomic DNA extraction
Validation of the EST-SSR markers was conducted with the 31 mung bean accessions mentioned previously. Genomic DNA was extracted from young leaves of these accessions using the Hexadecyl trimethyl ammonium Bromide (CTAB) extraction method [28]. DNA quality was evaluated on a 1.0% agarose gel electrophoresis. The working concentration of DNA was adjusted to 50 ng/ml for use in marker evaluations. Amplification was performed in 20 μl volume reactions containing 0.5 U of Taq DNA polymerase, 1 × PCR Buffer II, 1.5 mM MgCl2, 25 μM of dNTP, 0.4 μM of each primer, and 50 ng of genomic DNA. Microsatellite loci were amplified on a Heijingang Thermal Cycler (Eastwin, Beijing, China). PCR amplification was carried out with the following cycling conditions: one cycle of 4 min at 94°C, 30-35 cycles at 94°C for 30 s, 55-60°C for 30 s and 72°C for 30 s. The final extension was performed at 72°C for 10 min. The PCR products were analyzed by 8.0% non-denaturing PAGE (Polyacrylamide gel electrophoresis) using silver staining. Fragment sizes were estimated based on the 1 Kb size marker as a DNA ladder (Promega, Madison, WI, USA).

Genetic similarity analysis
A distance tree was built based on a genetic similarity matrix for the 31 mung bean accessions and branch support was estimated with 10,000 bootstraps. The number of alleles (N a ), observed heterozygosities (H o ) and polymorphism information content (PIC) for each of the EST-SSR markers were calculated using the software POPGEN 1.32 [29]. The cluster analysis of genotypes was carried out based on Nei's unbiased measures of genetic distance by using the unweighted pair-group method with arithmetic average (UPGMA) and coefficients of genetic similarity for the mung bean accessions calculated using the same program [29].

Sequencing and de novo assembly of Illumina paired-end reads
A total of 52.7 and 51.7 million paired-end raw reads were generated in Illumina next generation sequencing runs for the ZL1 and V6 varieties, respectively. After removal of the low quality reads, 51.9 and 50.9 million clean reads remained, with GC content of 43.1% and 44.8% in ZL1 and V6, respectively. In terms of sequence quality, ZL1 and V6 had 98.4% and 98.2% of Q20 or above bases and 94.5% and 93.9% of Q30 or above bases, respectively.
The combined sequence length of the Illumina reads was 10.3 Gb and could be assembled de novo into 48,693 unigenes and 83,542 individual transcripts. The average length of the assembled transcripts was 1,194 bp (N50 = 1,936 bp), which was longer than the average length of the assembled unigenes (874 bp, N50 = 1,563 bp). The range in length of the assembled unigenes was from 200 bp to 20,214 bp. A total of 25,590 unigenes (52.6%) were short, with lengths no longer than 500 bp. The next two size classifications of unigenes were of similar frequency with 9,141 unigenes (18.8%) having 501 to 1,000 bp in length, and 8,643 unigenes (17.8%) with lengths ranging from 1,000 to 2,000 bp. Finally 5,319 unigenes (10.9%) were longer than 2,000 bp (Fig 1).

Sequence annotation
For validation and annotation of the sequence assembly contigs and unique singletons, all unigenes were searched against the five databases as described earlier. A total of 48,693 unigenes provide a significant BLAST result, with 25,820 (53.0%) showing significant similarity to known proteins in the Nr sequence database, with 18,316 from Swiss-Prot (37.6%) and 17,652 from PFAM (36.3%), only 4,064 unigenes were annotated in all databases but 28,613 could be annotated in at least one while the rest (20,080 unigenes) were not annotated to the existing databases.

Development of polymorphic EST-SSR markers in mung bean
A total of 13,134 in silico EST-SSR markers could be developed form the 10,053 SSR containing sequences using Primer3 (S3 Dataset). A subset of 200 markers was randomly chosen from these loci to validate EST-SSR marker usefulness in monitoring polymorphisms for eight mung bean accessions (S4 Dataset). Of the markers tested, 129 primer pairs (65.0%) produced clear PCR amplicons of the expected sizes, 36 markers amplified non-specific products, and 35 did not amplify any clear DNA bands. Of the successfully amplifying EST-SSR markers, 66 (or 51.2%) were polymorphic and consisted of 4 mono-, 2 di-, 33 tri-, 6 tetra-, 3 penta-and 18 hexa motif based marker (Table 2) while the other 97 markers were monomorphic. An average of 3.0, 2.5, 2.2, 2.2, 2.0 and 2.6 alleles were generated for the mono-, di-, tri-, tetra-, penta-and hexa motif markers, respectively.

Gene functions of the unigene sequences containing polymorphic EST-SSRs
To determine the possible functions of the 66 validated EST-SSRs, they were subjected to BLASTn analysis with a non-redundant database of legume sequences. The results showed that most of the sequences were similar to known or hypothetical protein-encoding genes from common bean (Phaseolus vulgaris L.) and soybean (Glycine max L.) with a lesser proportion homologous to cowpea (Vigna unguiculata L. [Walp]) genes (S5 Dataset). Among the positive hits were genes for auxin efflux carrier component, dof zinc finger, F-box, gibberellin receptor, helicase, mitogen-activated and leucine-rich repeat extensin-like proteins as examples.

Phylogenetic analysis of the cultivated mung bean accessions
The 66 polymorphic EST-SSR markers developed in this study were used to assess the genetic diversity of 31 mung bean accessions from the complete geographic distribution of the crop for which a total of 154 alleles were detected and scored. The number of alleles per marker ranged from 2 to 5. Effective number of alleles per locus (N e ) varied from 1.074 (for marker MB64504)  (Table 3). Phylogenetic relationships between the accessions grouped the 31 accessions into two main clusters in a dendogram (Fig 3). Cluster 1 was comprised of accessions from Southeast and South Asian countries such as Thailand, Vietnam, Philippines, Indonesia, Myanmar, Nepal and India. Cluster 2 was comprised of accessions from East Asian and Northeast Asian countries such as China, Japan, Korea and Russia. Results indicated that geographical distances between collection sites for the accessions were associated with Nei's genetic distances between accessions.

Discussion
Transcriptome sequencing and de novo assembly has proven to be an important tool for gene discovery in many organisms and an effective method for molecular marker development [30,31]. Our results also proved that the short reads from Illumina paired-end sequencing of mung bean cDNAs can be easily assembled and used for transcriptome analysis, marker development and gene identification even without a reference genome for the crop. The marker validation confirmed previous evaluations of SSRs in common bean [8,32] and other legume crops, where EST-SSR markers detect moderate polymorphism.
Our work complements previous analysis with 454 sequencing of two mung bean cDNA libraries which resulted in the discovery of 1,630 and 1,334 EST-SSR primer pairs from the leaves of Jangan and Sunhwa varieties, respectively [16]. Here we concentrated on the use of Illumina sequencing to develop a larger total number of in silico EST-SSR markers to increase the number of SSRs available for mung bean. We found that the EST-SSR marker validation rate was similar to the success rate of SSR development in a previous study in mung bean [33] but slightly lower than when using sequenced BAC end sequences or small-insert genomic libraries in common bean [34,35]. Comparing these legumes, the polymorphism ratio of EST-SSR markers in mung bean was slightly higher than for EST-SSR in common bean [18].
In terms of the types of motifs found in SSR loci other than the mono-and large sized repeats, we found similar results as in previous work with plant microsatellites [33]. For example, the proportions of di-and tri-nucleotide repeats were quite close (13.8% versus 14.6%) as was found in previous results [30] The relative abundance of di-and tri-nucleotide repeats in ESTs sequences has been observed in many other legumes including common bean [8], cowpea [36] and chickpea [37]. The most common tri-nucleotide repeats found in the mung bean varieties studied here were GAA/TTC followed by TCT/AGA and CTT/AAG, which are similar with previous reports in mung bean [17] and common bean [8,18], possibly indicating a shared origin among the Phaseoleae tribe. As in common bean, AG/CT motif was the most abundant repeat motif (30.8%), followed by AT/TA (28.5%) contrasting slightly with other legumes outside the Phaseoleae tribe [36][37][38] but similar to estimates in common bean [8].
A larger number of repeat units were generally correlated with greater allelic variability for an SSR locus. Therefore, the shorter motif loci such as those with mono-and di-nucleotides repeats usually had to possess more repeats to be of equivalent polymorphism to longer motif repeats such as those with tri-nucleotide motifs. Previous studies in legumes have mainly focused on di-, tri-, and tetra-nucleotide SSRs [18], whereas mono-nucleotide SSRs perhaps have not drawn enough attention for marker development. We found that mono-nucleotide SSRs had higher polymorphism rates than previously thought, followed by tetra-nucleotide SSRs, justifying their inclusion in future SSR evaluations.
To determine the level of polymorphism of our new EST-SSR markers, we validated 200 loci, of which 129 markers (65.0%) produced successful amplicons, which is in between previously reported success rates of 21.0% [15] and 78.0% [17] in mung bean. The failure of 35 primer pairs to generate amplicons may be due to long intervening introns which would not allow successful genomic amplification of markers based on transcribed mRNA based sequences. Alternatively the location of primers across splice sites or regions of poor sequence quality could explain non-amplification.
Despite these issues, about two thirds of the EST-SSR markers were successful, suggesting that the transcriptome sequencing was accurate and the assembled unigenes were of high quality. In terms of allele detection, only half of the successfully amplified SSRs produced more than one allele and most had no more than 4 alleles, which was in agreement with a previous study [39] for mung bean. PIC values in this study were in line with previously reported values for mung bean SSRs [17,40]. They were also similar to EST-SSRs from adzuki bean which can The number of observed alleles. 2 The number of effective number of alleles. 3 The number of expected heterozygosity. 4 The number of observed heterozygosity. 5 Shannon's Information index (Lewontin, 1972). 6 Polymorphic information content. doi:10.1371/journal.pone.0120273.t003

Development of EST-SSR Markers in Mung Bean
amplify products in mung bean [41]. One advantage of EST-SSR markers, is that they may detect valuable genetic diversity possibly associated with traits of interest for breeding because of their location in genes.

Development of EST-SSR Markers in Mung Bean
In summary, the accomplishments of our study were 1) the detection of a large number of unigenes for mung bean and 2) the discovery of over 10,000 SSR containing sequences in the transcriptome of the crop. We observed that the number and lengths of unigenes in mung bean compared favorably to previous analyses with the generation here of approximately 25 million paired-end reads for the transcriptome which assembled into over 48 thousand unigenes with an average length of over 850 bp. Sanger sequencing of cDNAs do not efficiently produce this number of unigenes or sufficient overall contig lengths because of a limitation in the depth of sequencing, even when full-length cDNA libraries are used [31,42,43]. The use of transcriptomic data for in silico microsatellite development was shown to be promising and we were able to increase the number of possible EST-SSRs tenfold compared to previous studies [33]. The newly developed SSR sequences and EST-SSR markers we made will be important resources for basic research and together with SNP resources can significantly enhance the ability to find closely linked markers for traits of interest in the molecular breeding of mung bean.