De Novo Assembly of Transcriptome Sequencing in Caragana korshinskii Kom. and Characterization of EST-SSR Markers

Caragana korshinskii Kom. is widely distributed in various habitats, including gravel desert, clay desert, fixed and semi-fixed sand, and saline land in the Asian and African deserts. To date, no previous genomic information or EST-SSR marker has been reported in Caragana Fabr. genus. In this study, more than two billion bases of high-quality sequence of C. korshinskii were generated by using illumina sequencing technology and demonstrated the de novo assembly and annotation of genes without prior genome information. These reads were assembled into 86,265 unigenes (mean length = 709 bp). The similarity search indicated that 33,955 and 21,978 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 26,232 a unigenes were separately assigned to Gene Ontology (GO) database. When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 5,598 unigenes were assigned to 5 main categories including 32 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (2,862, 43.7%), suggesting the active metabolic processes in the desert tree. In addition, a total of 19,150 EST-SSRs were identified from 15,484 unigenes, and the characterizations of EST-SSRs were further compared with other four species in Fabraceae. 126 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among the 9 germplasms in Caranaga Fabr. genus, PCR success rate were 93.7% and the phylogenic tree was constructed based on the genotypic data. This research generated a substantial fraction of transcriptome sequences, which were very useful resources for gene annotation and discovery, molecular markers development, genome assembly and annotation. The EST-SSR markers identified and developed in this study will facilitate marker-assisted selection breeding.


Introduction
Caragana is a genus comprises about 100 species in the family Fabraceae, and distributes in Asia and Eastern Europe. Most of the Caragana species are shrubs or small trees and with the character of high tolerance to several abiotic stresses including drought, salt, and cold [1]. However, compared with its high ecological and economic values, the genome and genetic essence remain largely unknown because of little genomic information. Caragana korshinskii, which is widely distributes in sandy grassland in northwestern China and Mongolia [2], is a useful model organism for studying salt and drought resistance mechanisms in Caragana Fabr. because it can tolerate severe drought stress with 7.38% of soil water content- [3]. Much of the work conducted on this species to date has focused on the physiological mechanisms responsible for its resistance to abiotic factors [4,5]. More recent studies using different types of molecular markers, such as RAPD [6], AFLP [1,7] have provided useful information on its genetics and evolutionary history. However, partly because of the scarcity of suitable molecular markers, much remains to be learned about the genetic factors responsible for the ability of C. korshinskii to cope with various adverse environmental conditions. Molecular markers play important roles in many aspects of plant breeding, such as genetic diversity research [8], marker-assisted selection [9], and identification of genes that are responsible for desirable traits [10]. SSR is one of the most often used molecular markers and widely used in different aspects of agronomic research [11,12]. Traditional SSR marker development needs partial genomic DNA library construction, cloning and labour-intensive Sanger sequencing [13,14]. With the application of next-generation sequencing (NGS) technology, it has become possible to develop large numbers of SSR markers for non-model organisms quickly and cost-efficiently [15,16]. The transcriptome profile provides information on gene expression and regulation. Therefore, transcriptome analysis is essential to interpret the functional elements of the genome and reveal the molecular components of cells and tissues [17,18]. Transcriptome sequencing is an efficient way to generate functional genomic-level data for non-model organisms. Large collections of EST sequences are very important for gene annotation and discovery [19], comparative genomics [20], development of molecular markers [16], and population genomics studies of genetic variation associated with adaptive traits [21]. Until now, transcriptomic sequencing for SSR mining has been used in a wide range of angiosperm species, such as rubber tree [22], castor bean [23], sesame [24]. Furthermore, transcriptome-derived SSR markers have been found close to or within the functional genes [25,26]. And it was found the characters of di-, tri-, tetra-, penta-and hexa-nucleotide SSRs varying in different taxa. For example, tri-nucleotide repeats have generally been observed to have the highest frequency in many crops, including cotton, barley, wheat, maize, sorghum, rice and peanut [27][28][29]. While, in sesame and some Rosaceae species, the most abundant repeat motif type was the di-nucleotide type [24,30]. Transcriptomic information, however, is extremely lacking for the species of Caragana Fabr. Until now, there has been little interest in such data.
Companion with the NGS technology developing, an excellent opportunity exists to explore the issues related to SSR markers from the transcriptome of C. korshinskii. In this study, we first obtained the transcriptome of C. korshinskii by Illumina sequencing to validate and characterize microsatellite markers. Based on these databases, thousands of SSR loci were used to design SSR primers. A sample of these primers was further developed to estimate genetic diversity of nine representative species in Caragana Fabr.genus.

Plant materials collection and preparation
Caragana korshinskii seeds were provided by the Gansu Desert Control Institute. The seeds were sown on damp filter paper and incubated at 4°C for 4 days before being placed at 23°C under long-day (16 h light/8 h dark) conditions with a photosynthetic photon flux density of 150 μmol m -2 s -1 . After growth for one month, the different tissues from seedlings, including leaves, stems and roots, were harvested for RNA isolation.

RNA isolation and transcriptome sequencing
The total RNA of plants was extracted with TRIzol Reagent (Invitrogen, 15596-026) according to the manufacturer's instructions. The RNA samples that met the requirements were used to construct transcriptome sequence libraries. The total RNA of each sample was then pooled at equivalent quantities. Sequencing libraries were generated using a NEBNext Ultra RNA Library Prep Kit for Illumina (NEB, USA) following the manufacturer's recommendations. Following the manufacturer's procedures, mRNA was purified from the pooled total RNA using polyT oligo-attached magnetic beads. A fragmentation buffer was added to disrupt the mRNA into short fragments. Reverse transcriptase and random primers were used to synthesise the first-strand cDNA from the cleaved mRNA fragments. The second-strand cDNA was synthesised using buffer, dNTPs, RNaseH, and DNA polymeraseI. The double-strand cDNA was purified using the QIAquick PCR extraction kit (QIAGEN, Hilden, Germany) and washed with EB buffer for end repair and single nucleotide A (adenine) addition. Finally, sequencing adaptors were ligated onto the fragments. The required fragments were purified by AMPure XP beads and enriched by PCR to construct a library for transcriptome sequencing.

Data filtering and de novo assembly
The transcriptome library was sequenced using the Illumina HiSeq 2000 system. The sequencing-received raw image data were transformed by base calling into the sequence data, which were termed raw reads. The raw data were then filtered by data-processing steps to generate clean data via a process that included the removal of adapter sequences, reads in which unknown bases are greater than 10%, and low-quality sequences (the percentage of lowquality bases of quality value 5 is greater than 50% in a read). All of the raw data was submitted to the database with the code Bioproject: SUB718537 and BioSample: SAMN03121496. After obtaining the clean data, transcriptome assembly was accomplished by using Trinity software [31] with min_kmer_cov set to 2 by default and all other parameters set at default values.

Functional annotation of unigenes
For functional annotation, the assembled unigenes that might putatively encode proteins were searched against NR (http://www.ncbi.nlm.nih.gov/), Swiss-Prot (http://www.expasy.ch/sprot/), KEGG (http://www.genome.jp/kegg/) using the BLASTX algorithm. A typical cut-off value of E<1e-5 was used. With the NR annotation, the Blast2GO program [32] was used to obtain the GO annotation of unigenes according to component function, biological process and cellular component ontologies. After obtaining a GO annotation for every unigene, WEGO software [33] was used to perform GO functional classification for all unigenes and to understand the distribution of gene functions of the species at the macro level.

SSR mining and primer design
The RNA-seq data from the other four Fabrceae species, Cicer arietium, Lotus corniculatus, Medicago sativa, and Medicago truncatula were got from the database PlantGDB(http://www. plantgdb.org/). Then, the MISA software (http://pgrc.ipk-gatersleben.de/misa/misa.html) was used to identify microsatellites in the unigenes got in this study and the four above database. The standard of EST-SSRs was assumed to contain motifs of one to six nucleotides in size. Definement of microsatellites was used with the following settings, SSR repeat motifs and number of repeats shown respectively, mono-10, dimer-6, trimer-5, tetramer-5, pentamer-5, hexamer-5. The primer for each SSR was designed using Primer3 (http://primer3.sourceforge.net/releases. php).

PCR amplification and validation of selected EST-SSR markers
All primer pairs were screened for amplification and polymorphisms using DNA from 9 Caragana species including C.opulens Kom., C.microphylla Lam., C.intermedia Kuang et H.C.Fu, C.arborescens Lam., C.rosea Turcz.ex Maxim, C.roborovskyi Kom., C.stenophylla Pojark., C.acanthophylla Kom. C.korshinskii Kom. In total, 126 pairs of primers were designed (S3 Table) and validated by PCR. The DNA for PCR amplification was extracted from the control samples using the CTAB method [34]. PCR amplification was carried out as follows: 94°C for 4 min, followed by 35-40 cycles of 94°C for 30 s, 55-60°C for 30 s and 72°C for 30 s. The final extension was performed at 72°C for 10 min. The PCR products were analysed by electrophoresis on 1.0% agarose gels. Coefficients of genetic similarity for the 9 species used in this study were calculated using the SIMQUAL program of NTSYS-pc Version 2.10 [35]. A neighbor-joining dendrogram was constructed based on the genetic similarity matrix with the SHAN clustering program of NTSYS-pc using the UPGMA algorithm.

Illumina paired-end sequencing and de novo assembly
To elucidate the transcriptome of C. korshinskii, RNA was extracted from different tissues and sequenced with Illumina paired-end sequencing technology. In this study, a total of 66,351,948 raw sequencing reads with a length of 100 bp were generated from a 200 bp insert library. After removing adaptors and low-quality data, 64,031,599 clean reads were obtained. Then, the highquality reads were used to assemble the transcriptome data with Trinity software. According to the overlapping information of high-quality reads, 202,163 transcripts were generated with an average length of 1,089 bp and an N50 of 1,772 bp. After extracting the longest transcript for each transcript, 86,265 unigenes were obtained. The average length was 709 bp, and the length greater than 500 bp accounted for approximately 37.27% (Table 1, S1 Fig.).

Annotation of all nonredundant unigenes
For the validation and annotation of the assembled unigenes, the assembled unigenes were searched against the NCBI non-redundant (NR) and SwissProt protein databases using the BLAST2 program with an E-value threshold of 1e-5. Among 86,265 unigenes, 33,955 (39.36%) had significant similarity to 21,118 unique proteins. Of all of the unigenes, 21,978 (25.47%) with significant identities to SwissProt proteins were matched with 21,978 unique protein accessions (Table 2). A lower percentage was obtained when searching against the SwissProt protein database. In total, BLAST searches identified 16,533 unique protein accessions from the NR and SwissProt protein databases, suggesting that this Illumina paired-end sequencing project generated a substantial fraction of the expressing genes in this study.

Functional classification by GO analysis
Gene ontology (GO), is an internationally standardised gene functional classification system. In order to classify the functions of the predicted C. korshinskii unigenes, GO analysis was performed. In total, 26,232 unigenes with BLAST matches to known proteins were assigned to GO classes with 9787 functional terms (Table 2, S1 Table). As shown in Fig. 1, assignments to the biological process constituted the majority (67,062, 46.17%), followed by cellular component (45,175,31.1%) and molecular function (33,021, 22.73%).
Under the category of biological process, cellular process (15,743, 23.48%) and metabolic process (14995, 22.36%) were prominent, indicating that important cell processes and metabolic activities occurred in C. korshinskii. Under the classification of molecular function, binding (15,374,46.6%) and catalytic activity (12,420, 37.6%) were the first and second largest categories, respectively, whereas other categories, such as transporter activity, structural molecule activity, nucleic acid binding transcription factor activity, and molecular transducer activity, contained 4017 unigenes, representing only 12.17%. Regarding the cellular components, two categoriescell and cell part-represented approximately 38.75% of cellular components, organelle accounted for approximately 13.84%, and membrane and membrane part accounted for 19.21%.

Functional classification by the KEGG pathway
To further analyse the transcriptome of C. korshinskii, all of the unigenes were analysed in the KEGG pathway database.The KEGG pathway database is a knowledge base for the systematic analysis of gene functions in terms of networks of genes and molecules in cells and their variants specific to particular organisms. Out of the 86,265 unigenes, 5598 (6.49%) with significant matches in the database were assigned to 5 main categories, including 32 KEGG pathways (Fig. 2, S2 Table). Among these 5 main categories, metabolism was the largest (2862, 43.7%), followed by genetic information (1485, 22.68%), organismal systems (1045, 15.96%), cellular processes (644, 9.83%) and environmental information processing (513, 7.83%). These results indicate that active metabolic processes were on-going. As shown in S2 Table, KEGG metabolism contained 12 categories, such as carbohydrate metabolism, nucleotide metabolism, the biosynthesis of other secondary metabolisms, amino acid metabolism, lipid metabolism, and energy metabolism, among others.

Motif comparison of EST-SSR markers among 4 Caragana Fabr. species
In this study, the 86,265 unigenes generated in this study were used to mine potential microsatellites that were defined as mono-to hexa-nucleotide motifs with a minimum of three repetitions. Using the MISA software, a total of 19,150 potential simple sequence repeats (SSR) were identified in 15,484 unigenes. Of the 15,484 unigenes, 12,575 and 2,909 unigenes contained one and more than one SSR, respectively ( Table 3). The number of potential EST-SSR per unigene varied from 1 to 8, with an average of 1.17.
To obtain a comprehensive perspective of motif distribution, we further compared our results with data from other species in Fabraceae. Besides Mono-nucleotide type, the di-and tri-type were the most two frequent types (Fig. 3). Comparing with other 4 species, the di-nucleotide was the most abundant type, while for the other 4 species, the tri-nucleotide was the most frequent type. The dominant di-nucleotide repeat motif in SSRs was AG/CT, whereas CG/GC was the least abundant (Table 4). Among the tri-nucleotide repeats, the most frequent repeat was AAG/CTT, followed by ACC/GTT (19.6%) and AAT/ATT. The most frequent of the motif was consistent with the other 4 species. While in Lotus corniculatus, ACC/GGT motif was the second abundant type, and in Medicago sativa and Medicago truncatula, the third abundant type was the ACC/GGT motif.

Validation of EST-SSR markers
Based on the SSR-containing sequences, 126 SSR sites were randomly selected to design EST-SSR primers with Primer Premier 3.0. The information of the EST-SSR primers is shown in S3 Table. Among the 126 primer pairs, 118 were successful in PCR amplification with genomic DNA, and the remaining eight pairs of primers failed to generate PCR products at various annealing temperatures. Of the 118 working primer pairs, 98 PCR products showed specific amplification, among which 90 PCR products generated expected sizes, whereas the other nine generated PCR products that were larger than expected, suggesting that the amplified regions likely contained introns. A total of 20 PCR products generated more than one band, which might result from the primer design or the high heterozygosity of the Caragana germplasm.
All polymorphic loci were used to analyse the diversity of 9 species. The observed number of alleles (A) ranged from 1 to 5, with an average of 2.12 alleles per locus. The genetic distance was calculated by the NTSYS software. It was showed that for the 9 species could be divided into two groups (Fig. 4), the C.rosea Turcz.ex.Maxim was the far distance with the other species. The other eight species could be divided into two groups. Among the groups, the C.Korshinskii was closet to C.microphylla, C.intermedia Kuang et H.C.Hu and C.arborescens Lam.

Discussion
In this study, a large number of C.Korshinskii transcriptomic unigenes (86,265) were sequenced with the Illumina HiSeq 2000 platform ( Table 1). The N50 length of the unigenes was 1231 bp, and the average length was 709bp.These results are comparable to recently published plant transcriptomic analyses, such as that of Gossypium aridum (N50 = 593 bp) [36] and Momordica cochinchinensis(N50 = 450bp) [37]. Trinity is one of most powerful packages in the de novo assembly of short reads. In this study, fewer than half of the unigenes (41,493, 48.09%) were successfully annotated by BLAST against the public databases Nr, Nt, Swiss-Prot, GO and KEGG, given the absence of genomic information of C. Korshinskii (Table 2). Notably, the percentage of annotation is relative low among the previous studies using the same sequencing strategy during the last year (55 to 78.9% [38][39][40]). One possible reason of this lack of  annotation is technical limitations, such as sequencing depth and read length [41], which were common in all of the studies using de novo transcriptome analysis. The unannotated sequences were, on average, much shorter than were the annotated unigenes (382 bp vs 1182 bp).
The EST-SSR marker is important for a variety of research, including the assessment of genetic diversity, the development of genetic maps, comparative genomics, marker-assisted selection, and other fields. Until now, there has been no report of EST-SSR identification in desert trees. The transcriptome sequencing provided many sequences for developing numerous EST-SSR markers in the C.Korshinskii tree. In total, 19,150 potential EST-SSRs were identified from 15,484 unigenes. In this study, in addition to the common di-, tri-and other nucleotide repeats that were included in the selection, the mono-nucleotide repeats included SSR, and its proportion was greater than that of the other types. If mono-nucleotide repeats were excluded, the frequency of di-nucleotide was higher than that of tri-nucleotide in C.Korshinskii, which is different with the other 4 species. In previous studies, some of the results showed that di-nucleotide was the most abundant type, such as sesame [24], oilpalm [42], while other results showed tri-nucleotide was the most abundant type, like barley [43], wheat [29]. The most abundant di-and trinucleotide motifs were AG/TC and AAG/TTC, respectively. These results are consistent with results for dicots, such as oak [44] and castor bean [23].
Of 126 primer pairs that were randomly selected for PCR validation, 118 (93.7%) produced clear bands. The PCR success rate was the same as in previous studies, such as Populus euphratica [45], and higher than the results from Triwitayakorn et al. (75%) [46], which mean that the identified EST-SSR makers have high cross-transferability in Caragana genus. The polymorphism frequency among the 9 species was 90.5%, and this ratio is much higher than that in crops. For example, in sesame, 276 (92.0%) EST-SSR primer pairs yielded PCR amplification products in 24 cultivars. Thirty two primer pairs (11.59%) exhibited polymorphisms. Moreover, 203 primer pairs (67.67%) yielded PCR amplicons in the wild accession and 167 (60.51%) were polymorphic between species [24]. In peanut, 26 (10.3%) EST-SSRs exhibited polymorphisms between 22 cultivated peanut accessions and 221 (88%) were polymorphic between 16 wild peanut species [28].
Our dendrogram, based on genetic similarity results, divided the 9 species into 2 clear groupings. Among the group, the C.Korshinskii was closet to C.microphylla, C.intermedia Kuang et H.C.Hu and C.arborescens Lam. This result was consistent with other previous researches [47]. While for the other group, the C.roborovskyi Kom. was classified into the group, this result was different with the classification described by Zhao (1993) [47,48]. He divided the Caragana Fabr. genus into three sub-groups, and C.roborovskyi Kom. was the only one species in one sub-group. While the classification was only based on morphology investigation, it needed more molecular evidence to support. So the result in our research extended the researches for Caragana Fabr. genus.

Conclusion
To sum up, our data provided a case study that the microsatellite markers developed from transcriptome of C. korshinskii Kom. can be used for population genetic studies in Caranaga Fabr. Likewise, these markers should be of great value in further research on population and conservation genetics of species in the genus. The use of transcriptome sequences from next-generation sequencing, being rapid and efficient in the development of microsatellite markers, is of great value, not only for analysing intraspecies genetic diversity, but for future research across the genus.

Author Contributions
Conceived and designed the experiments: YL XWP. Performed the experiments: YYW SSW. Analyzed the data: XJT JW. Contributed reagents/materials/analysis tools: XJT YYW. Wrote the paper: YL.