Global Transcriptome Sequencing Using the Illumina Platform and the Development of EST-SSR Markers in Autotetraploid Alfalfa

Background Alfalfa is the most widely cultivated forage legume and one of the most economically valuable crops in the world. The large size and complexity of the alfalfa genome has delayed the development of genomic resources for alfalfa research. Second-generation Illumina transcriptome sequencing is an efficient method for generating a global transcriptome sequence dataset for gene discovery and molecular marker development in alfalfa. Methodology/Principal Findings More than 28 million sequencing reads (5.64 Gb of clean nucleotides) were generated by Illumina paired-end sequencing from 15 different alfalfa tissue samples. In total, 40,433 unigenes with an average length of 803 bp were obtained by de novo assembly. Based on a sequence similarity search of known proteins, a total of 36,684 (90.73%) unigenes were annotated. In addition, 1,649 potential EST-SSRs were identified as potential molecular markers from unigenes with lengths exceeding 1 kb. A total of 100 pairs of PCR primers were randomly selected to validate the assembly quality and develop EST-SSR markers from genomic DNA. Of these primer pairs, 82 were able to amplify sequences in initial screening tests, and 27 primer pairs successfully amplified DNA fragments and detected significant amounts of polymorphism among 10 alfalfa accessions. Conclusions/Significance The present study provided global sequence data for autotetraploid alfalfa and demonstrates the Illumina platform is a fast and effective approach to EST-SSR markers development in alfalfa. The use of these transcriptome datasets will serve as a valuable public information platform to accelerate studies of the alfalfa genome.


Introduction
Alfalfa (Medicago sativa L. subsp. sativa), a member of the Fabaceae family, is the most widely cultivated forage legume in the world and is the third most valuable crop in the United States, after corn (Zea mays L.) and soybean (Glycine max (L.) Merr.) [1]. Alfalfa is referred to as "the queen of forage crops" because it is highly productive and stress tolerant and provides a valuable forage crop for livestock. In addition, alfalfa has considerable potential for ethanol production. The stems can be separated from the leaves and used as a biofuel crop, while the leaves can be used as a protein supplement for animals [2]. To improve its rumen digestibility, the value of alfalfa could be enhanced by increasing the cellulose content and decreasing the lignin content in stem cell walls.
Cultivated alfalfa is a perennial, obligate outcrossing, autotetraploid (2n = 4x = 32) plant with a genome size of 800-900 Mb [3]. Synthetic populations of alfalfa are extremely polymorphic due to their high degree of outcrossing. These features complicate genetic and genomic studies of alfalfa and have led to the development of M. truncatula as a model legume. M. truncatula is a diploid, self-fertile species with a short life cycle and a small genome (500 Mb) [4]. Several framework genetic linkage maps have been constructed for diploid and tetraploid alfalfa using Restriction Fragment Length Polymorphism (RFLP), Simple Sequence Repeat (SSR), Amplified Fragment Length Polymorphism (AFLP), and Singledose Restriction Fragments (SDRFs) markers [5,6,7,8,9,10,11,12,13,14], and Quantity Trait Loci (QTLs) associated with agronomic traits within relatively large genomic regions have been identified [11,12,15,16,17,18,19,20,21,22,23]. Several studies attempted to transfer EST-SSRs from M. truncatula to M. sativa. However, only a small number of the EST-SSR markers were polymorphic and could be incorporated into the autotetraploid map of M. sativa [11,23].
RNA-Seq, a massively parallel sequencing method for transcriptome analysis, only analyzes transcribed portions of the genome, avoiding the non-coding and repetitive sequences that comprise much of the genome. Recently, RNA-Seq has expanded the identification of alfalfa genes. Using Illumina sequencing, 124,025 unique sequences from MSGI 1.0 have been identified from the elongating stem and post-elongation stem internodes of two alfalfa genotypes with divergent cellulose and lignin concentrations [24]. Using 454 sequencing, 54,216 unique sequences were obtained from the roots and shoots of two alfalfa genotypes with different water stress sensitivities [25]. Illumina sequencing of old and young stems of 27 alfalfa genotypes led to the identification of 25,183 contigs [26]. Additionally, Illumina sequencing was performed in alfalfa roots contrasting in salt tolerance, and 60,290 tentative consensus sequences were obtained [27]. While these experiments have identified numerous transcripts, the transcripts were derived from limited tissues, such as stems, roots, and shoots. These transcriptomes do not encompass the complete set of transcripts in other organs or tissue systems during the different developmental periods of alfalfa. Furthermore, the identification of leaky transcripts may be essential to elucidate the functional elements of alfalfa and reveal the molecular characteristic of cells and tissues. Therefore, sequencing the transcriptome of a broader array of tissues will permit the global identification of transcripts that may be useful in modern alfalfa breeding programs.
In the present study, we performed de novo transcriptome sequencing of M. sativa L. subsp. sativa using the Illumina GA IIx sequencing platform. A total of 15 tissue types were included to identify the global and complete transcriptome of this species. The transcriptome coverage was 5.64 Gb of clean nucleotides, which was sufficiently comprehensive to identify most known genes and major metabolic pathways. A total of 40,433 unigenes were identified, and 1,649 SSRs were determined. This dataset is a public information platform for gene expression analysis, genomics, and functional genomics in M. sativa L. subsp. sativa.

Tissue Material and RNA Isolation
The alfalfa cultivar "Golden queen" was grown in a greenhouse under a 16 h light/8 h dark cycle at 22°C at Lanzhou University, Lanzhou, China. A total of 15 tissue types were collected, including germinated seeds (36 hours after seed germination), germinated seeds (48 hours after seed germination), cotyledons (from a 7-day-old seedling), unifoliate leaves (from a 20-day-old seedling), roots (from a 20-day-old seedling), compound leaves, young stems (less lignified), middle stems (moderately lignified), old stems (highly lignified), shoot apex, young inflorescences (diameter 0.4-0.5 cm), mature inflorescences (diameter 2 cm), young pods (16 days after pollination), and mature pods (24 days after pollination) ( Figure 1). Callus cells were induced from alfalfa seeds on MS solid medium containing 2,4-D (2.0 mg/L) and 6-BA (0.5 mg/L) under a 16 h light/8 h dark cycle at 25°C for approximately 30 days. The callus cells were actively dividing and consisted of a mass of undifferentiated cells. All sampled tissues were immediately frozen in liquid nitrogen and stored at -80°C until RNA extraction.
For Illumina sequencing, the total RNA from each sample was isolated using the RNeasy Plant Mini Kit (Qiagen, Cat. # 74904). Both the quantity and quality of the RNA were verified using NanoDrop ND1000 (Thermo Scientific) and Agilent 2100 Bioanalyzer (Agilent). The total RNA from all 15 samples was adjusted to the same concentration. To cover more tissuespecific transcripts in alfalfa, a total of 20 µg of RNA was pooled equally from the 15 tissues for cDNA library preparation.

cDNA Library Construction and Sequencing
The cDNA library was constructed following the manufacturer's instructions for the mRNA-Seq Sample Preparation Kit (Cat # RS-930-1001, Illumina Inc., San Diego, CA). Briefly, the total RNA was collected and pooled from all 15 alfalfa tissues, and magnetic oligo(dT) beads were used to isolate the poly(A) mRNA. An RNA Fragmentation Kit (Ambion, Austin, TX) was used to cleave the mRNA into short fragments, which were then used as templates to reverse-transcribe firststrand cDNA using random hexamer primers and reverse transcriptase (Invitrogen, Carlsbad, CA). Second-strand cDNA was synthesized in a reaction containing buffer, dNTPs, RNase H (Invitrogen) and DNA polymerase I (New England BioLabs, Ipswich, MA). A paired-end library was synthesized using the Genomic Sample Preparation Kit (Illumina) according to the manufacturer's instructions. Short fragments were purified with the MinElute PCR Purification Kit (QIAGEN, Dusseldorf, Germany) and eluted in 10 µL of EB buffer (QIAGEN). The short fragments were connected with sequencing adapters, and the desired fragments (200 ± 25 bp) were separated by agarose gel electrophoresis and purified with a gel extraction kit. Finally, the sequencing library was constructed by PCR amplification (15 cycles) and sequenced using the Illumina Genome Analyzer IIx sequencing platform. Data analyses and base calling were performed with the Illumina instrument software. The transcriptome datasets are available in the NCBI Sequence Read Archive (SRA) under accession number SRX247927.

Sequence Analyses, Assembly, and Annotation
All raw reads were cleaned by removing adapter sequences, low-quality sequences with ambiguous bases 'N', and reads with more than 10% Q < 20 bases. De novo transcriptome assembly of these quality reads was performed using the Trinity software, which recovers more full-length transcripts across a broad range of expression levels with a sensitivity similar to methods that rely on genome alignments [28]. The overlap settings used for the Trinity assembly were 31 bp with 80% similarity, and all other parameters were set to default values.
For further analysis, we first used BLASTX (E-value <10E-5) to compare the assembled sequences against the NCBI Nr, Nt, and Swiss-Prot databases. Gene names were assigned to each assembled sequence based on the highest scoring BLAST hit. To annotate the assembled sequences with Gene Ontology (GO) terms describing biological processes, molecular functions and cellular components, we used the BLAST2GO program to obtain the GO annotations of the unigenes [29]. The annotated unigenes were enriched and refined using ANNEX [30].
The unigene sequences were also aligned to the Clusters of Orthologous Group (COG) database. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were assigned to the assembled sequences using online software (http:// www.genome.jp/kegg/kegg4.html) to predict and classify functions. We used the bi-directional best hit (BBH) method to obtain KEGG Orthology (KO) assignments [31].
We also performed a homologous comparison of the genomes of alfalfa and M. truncatula. The genome sequence data for M. truncatula were obtained from the Mt3.5.2 release (http://tofu.cfans.umn.edu/). Based on nucleotide sequence similarity, a BLAST search was performed with a cutoff E-value <10E-10 using blast-2.2.23-ia32-win32.

EST-SSRs Amplification and Diversity Analysis
A total of 10 alfalfa accessions including Chinese landraces, cultivars, and foreign collections (Table S1) were selected for polymorphism investigation with the EST-SSRs. Total genomic DNA was extracted from young leaves of five individual plants in each accession via the cetyltrimethylammonium bromide (CTAB) method [32]. PCR amplifications were conducted in a final volume of 10 µL containing 40 ng template DNA, 1× PCR buffer, 2.0 mM MgCl 2 , 2.5 mM dNTPs, 4 µM each primer, and 0.8 U Taq polymerase (TaKaRa). The PCR reaction cycling included 4 min at 94°C, 35 cycles of 30 s at 94°C, 35 s at the annealing temperature (Table S2), and 1 min at 72°C, with a final extension step of 5 min at 72°C. The PCR products were subjected to electrophoresis on 8.0% non-denaturing polyacrylamide gels and stained by ethidium bromide [33]. The band sizes were determined by comparison with the DL500 DNA maker (TaKaRa). The observed heterozygosity (Ho) was calculated as previous described [34,35], and the corrected heterozygosity (He, corrected for sample size) and Shannon-Wiener diversity index (H'c, corrected for sample size) were analyzed by the ATETRA 1.2.a software program [36] . Only specific bands that could be unambiguously scored across all individual plants were used in this study. The resulting data matrix was analyzed using POPGENE v. 1.31 [37].

Illumina Sequencing and de novo Assembly
To globally and comprehensively cover the alfalfa transcriptome, total RNA was extracted from 15 different alfalfa tissues. The morphology of each tissue is shown in Figure 1. Equal amounts of total RNA from each alfalfa tissue were pooled together. The cDNA was subjected to Illumina Genome Analyzer sequencing. After stringent quality assessment and data filtering, 28,790,610 reads (5.64 Gb) with 94.48% Q20 bases were selected as high-quality reads for further analysis. An overview of the sequencing results is presented in Table 1.
The high-quality reads were deposited into the NCBI SRA database (SRX247927).
All high-quality reads were assembled de novo using the Trinity program [28], which produced 81,277 scaffolds with an N50 length of 1,323 bp and a mean length of 873 bp. The distribution of the scaffolds is shown in Table 2. A total of 40,433 unigenes were obtained with an N50 length of 1,300 bp and a mean length of 803 bp. The length distributions of the unigenes are shown in Table 2. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given unigene and the number of reads assembled into it ( Figure 2).
To further assess the extent of transcript coverage provided by the unigenes, we plotted the ratio of assembled unigene length against M. truncatula ortholog length ( Figure 3). Based on the similarity of coding sequences, a BLAST search was performed with a cutoff E-value <10E-10 between M. truncatula and M. sativa. Among the 64,127 (Mt3.5.2) transcripts, 41,447 (64.63%) M. truncatula transcripts had homologous transcripts in the M. sativa genome. This finding suggests that most of the M. truncatula ortholog coding sequences were covered by at least one individual unigene. The plotting results revealed that 14,933 M. truncatula orthologs (36.03%) were covered by unigenes with a percentage less than 100%, and 26,514 orthologs (63.93%) were covered by unigenes with a percentage greater than 100% (Figure 3). This finding indicates that additional sequencing and more advanced assembly technology are needed for more comprehensive and precise transcriptome data in alfalfa. In addition, the available M. truncatula genome sequence was used as a scaffold to align the alfalfa unigene sequences. Under stringent conditions using Blat, including a threshold of 95% identity and 90% coverage (Figure 4), 27,853 (68.89%) unigenes were mapped to the Mt3.5.2 genome sequence assembly, and their likely map positions were inferred. Among the nine chromosomes in M. truncatula, chromosome 5 (4,348) and 6 (1,491) showed the highest and the lowest number of unigenes on the Mt3.5.2 chromosomes, respectively (Table 3).

Functional Annotation
Unigene annotations provide functional information, including protein sequence similarities, GO terms, COG clusters, and KEGG pathway information. The unigenes were annotated by alignment with the following databases: the National Center for Biotechnology Information (NCBI) nonredundant protein (Nr) database, the NCBI non-redundant nucleotide sequence (Nt) database, Swiss-Prot, KEGG, COG, the InterPro (Ipr) database, and TrEMBL. The best alignment was selected from the matches with E-values less than 10E-5. The overall functional annotations are depicted in Table 4. Altogether, 36,684 (90.73%) unigenes were successfully annotated in the Nr, Nt, Swiss-Prot, KEGG, COG, Ipr, and TrEMBL databases, suggesting that they have relatively well-conserved functions  (Table 4). According to the RPKM, 5,000 unigenes were chosen as highly expressed unigenes, of which 4,941 unigenes were annotated by the NCBI Nr protein database (Table S3). Based on 29,026 Nr annotations, 14,415 (35.65%) unigenes were assigned to one or more GO terms. The GO-annotated unigenes belonged to the biological process, cellular component, and molecular function clusters and were distributed across 35 categories, with 34.4% in biological processes, 50.7% in molecular functions, and 14.9% in cellular components ( Figure 5). In the biological processes category, metabolic processes (GO:0008152) was the dominant group, followed by cellular processes (GO:0009987), biological regulation (GO:0065007). With regard to molecular functions, the majority of unigenes were assigned to binding (GO: 0005488), followed by catalytic activity (GO:0003824) and transporter activity (GO:0005215). For the cellular components category, cell (GO:0005623) and cell part (GO:0044464) were highly represented ( Figure 5).
In addition, all 40,433 unigenes were subjected to a search against the COG database. Of 29,026 unigenes with significant similarity to nr proteins in this study, 8,522 (21.08%) were assigned to COG classifications ( Figure 6). Among the 25 COG categories, the cluster for general function prediction represented the largest group, followed by replication,  Figure 6). To further analyze the alfalfa transcriptome, all 40,433 unigenes were analyzed with KEGG pathway tools. This process predicted a total of 293 pathways that were represented by 5,723 (14.15%) unigenes (Table S4).

SSR Discovery and Polymorphic Analysis
To further assess the assembly quality and to develop new molecular markers, all 40,433 alfalfa unigenes that were identified in the present study were used to mine potential microsatellites, which are defined as di-to hexa-nucleotide SSRs with a minimum of five repeats for all motifs. First, 10,901 unigenes with lengths greater than 1 kb were screened. Using the SSRIT tool, a total of 1,649 potential EST-SSRs were identified from 1,494 unigenes, of which 185 sequences contained more than one EST-SSR (Table 5). To identify the putative functions of the genes containing the SSR loci, 1,494 unigenes were searched against the Nt database with an Evalue less than 10E-5. Of these queries, 1,485 unigenes had BLAST hits to known proteins in this database (Table S5).
Based on the SSR-containing sequences, 100 pairs of EST-SSR primers were designed using BatchPrimer3. Detailed information about the designed primers is shown in Table S2.  Of the 100 primer pairs, 82 were able to amplify PCR products from alfalfa genomic DNA, while 18 primer pairs failed to amplify PCR products at various annealing temperatures and Mg 2+ concentrations. Of the 82 successful primer pairs, 37 PCR products were of the expected sizes, and 34 primer pairs generated PCR products that were larger than expected. The PCR products of the other 11 primer pairs were smaller than expected. Of the 37 primer pairs that amplified PCR products of the expected sizes, 27 PCR amplifications resulted in more than one band, which might be due to the high heterozygosity of the autotetraploid alfalfa germplasm. These 27 primer pairs were used to evaluate polymorphisms across 10 alfalfa accessions (Table 7 and Figure 7), with five individual plants in each accession. All 27 EST-SSR loci showed different allelic polymorphisms. The number of alleles per locus varied from three to 11 (mean: 6.11). The mean values of Ho, He, and H'c was 0.64, 0.64, and 1.23 ( Table 7). The maximum genetic distance value was 0.1517 between the Longdong I and AmeriStand 403T accessions and the minimum value of 0.0271 between the Longdong III and Xinjiangdaye accessions (Table S6).

Illumina Paired-end Sequencing
Transcriptome sequencing is one of the most effective tools for identifying transcript sequences, which is essential for identifying novel genes and developing molecular markers. Next-generation sequencing (NGS) technologies have provided opportunities for high-throughput gene discovery on a genomewide scale with or without a complete genome sequence. Because of its high efficiency, speed, accuracy, and low cost, NGS has been widely used to analyze transcriptome sequencing and assembly in many non-model plants such as pine [38], ginseng [39], Artemisia annua [40], common vetch [41], bamboo [42], sweet potato [43], sesame [35], rubber [44], safflower [45], and alfalfa [24].
Prior to the present study, all alfalfa sequencing efforts using RNA-Seq and 454 sequencing were based on a limited number of tissues, such as stems, roots, and shoots [24,25,26]. In the previous studies, some homologous transcripts specifically expressed in other tissues might be leaky. Here, we used the Illumina HiSeqTM 2000 platform to profile the alfalfa transcriptome from 15 different tissues, obtaining 5.64 Gb clean reads and identifying 40,433 unigenes from a de novo assembly. Of the 40,433 unigenes, 36,684 (90.73%) were successfully annotated, indicating their conserved functions. This procedure produced unigenes with an average length of 803 bp; the alfalfa unigenes produced in previous studies had   Table 5. Summary of EST-SSR types in the alfalfa transcriptome.

Repeat motif Number a Percentage b
Di-nucleotide  average lengths of 541 bp [25] and 384 bp for unique sequences [24] in alfalfa.
Altogether, 90.73% of the unigenes were annotated in the Nr, Nt, Swiss-Prot, KEGG, COG, Ipr, and TrEMBL databases. The low percentage (9.3%; 3,749 of 40,433) of unmapped unigenes that were assigned a putative function might be mainly due to the relatively short sequences of the resulting unigenes, most of which likely lack conserved functional domains [46]. Alternatively, these sequences might contain a known protein domain but not display sequence matches due to the short query sequence, resulting in false-negative results. Specifically, 20.1% of unigenes shorter than 300 bp, 8.1% of unigenes between 300 bp and 1,000 bp and 0.1% of unigenes longer than 1,000 bp had no BLAST matches. Another potential reason for a lack of matches is that some of these short unigenes represent non-coding RNAs. The insufficient number of ESTs and limited genomic information for alfalfa in public databases also influences the annotation efficiency. Furthermore, a minority of unigenes without BLAST hits might function as autotetraploid alfalfa-specific genes. The alfalfa unigenes were assigned to a wide range of GO categories and COG classifications, indicating that our pairedend sequencing data represented a wide diversity of transcripts (Figures 5 and 6). Most representative unigenes were mapped by the KEGG database to specific pathways, such as plant hormone signal transduction, amino acid metabolism, lipid metabolism, energy metabolism, plant-pathogen interactions, photosynthesis, and transcription factors (Table S4). Among the GO categories, cell, binding activity, and metabolic process were the most abundant classes in the cellular component, molecular function, and biological process categories, respectively; these results are consistent with studies in sweet potato [43], rubber [44], and sesame [35].
We generated a large transcriptome that can be used for novel gene discovery or for the investigation of alfalfa molecular mechanisms. Our results confirm that highthroughput Illumina paired-end sequencing is an efficient, inexpensive, and reliable method for transcriptome analysis in non-model plants, including autotetraploid plants.

SSR Markers Identification and Characterization
SSRs are powerful molecular markers that are locus-specific, co-dominant, multi-allelic, highly polymorphic, and transferable among species within genera [47]. EST-SSR markers are very important for studies involving genetic diversity, cultivar identification, evolution, linkage mapping, QTL mapping, comparative genomics, and marker-assisted selection breeding [26]. Most alfalfa SSR markers are derived from M. truncatula, with the exception of 61 polymorphic genomic SSRs [48] and 29 polymorphic EST-SSRs [49] developed from alfalfa. Therefore, the development of novel SSR markers from alfalfa could be more useful for genetic studies and breeding applications. The transcriptome sequencing provided many of sequences for developing numerous EST-SSRs. Although we used stringent selection criteria 1,649 potential EST-SSRs were identified in 1,494 unigenes. Tri-nucleotide repeats were the most abundant motif type, followed by di-, tetra-, penta-, and hexa-nucleotide repeats, which is consistent with previous reports [49].
Of 100 PCR primer pairs that were randomly selected for validation, 82 amplified clear bands. Our PCR success rate was higher than in a previous study [49], but the success rate was a little lower than those reported for sweet potato [43], rubber [44], and sesame [35]. The failure of 18 primer pairs to produce amplicons may have been due to the amplification with genomic DNA, the location of the primer across splice sites, large introns, chimeric primer, or poor-quality sequences [35]. The deviation of 45 amplicons from expected sizes may have been caused by the presence of introns, a lack of specificity, or assembly errors [44]. The 10 PCR products presented only one band, which might result from the primer design or the homozygosity of the loci in alfalfa germplasm.
Using 27 polymorphic EST-SSR markers, the He values ranged from 0.17 to 0.84, and most of the markers (88.9 %) had He values greater than 0.5. This result indicated a high level of polymorphism in alfalfa, as suggested previously [34,49]. This result may be due to a very high cross-pollination and autotetraploidy in this species. The 27 polymorphic EST-SSR markers are important for alfalfa research involving genetic diversity, relatedness, evolution, linkage mapping, cultivar protection, comparative genomics, and gene-based association studies. Next generation transcriptome sequencing provides plenty of sequences for molecular marker development. The 1,649 SSRs identified from our data will produce a wealth of markers for further genetic study. Based on these identified SSR-containing sequences, we will design more PCR primers and evaluate their polymorphism among alfalfa accessions with more individual plants in each accession, and provide a more valuable resource of genetic markers for future research in alfalfa.

Conclusions
This work presents a de novo transcriptome sequencing analysis of mixed RNAs from 15 different tissues. A total of 5.64 Gb of data were generated and assembled into 40,433 unigenes. Based on these sequences, EST-SSRs were predicted and analyzed. The 1,649 potential EST-SSRs predicted in this study provide a solid foundation for molecular marker development in alfalfa. A total of 27 polymorphic primer pairs successfully amplified fragments, revealing abundant polymorphisms between 10 alfalfa accessions. To our knowledge, this is the first study to use Illumina paired-end sequencing technology to investigate the global transcriptome of autotetraploid alfalfa and to assemble the reads without a reference genome. This valuable resource will provide a foundation for future genetic and genomic studies of autotetraploid alfalfa species.