De Novo Transcriptome Assembly of Pummelo and Molecular Marker Development

Pummelo (Citrus grandis) is an important fruit crop worldwide because of its nutritional value. To accelerate the pummelo breeding program, it is essential to obtain extensive genetic information and develop relative molecular markers. Here, we obtained a 12-Gb transcriptome dataset of pummelo through a mixture of RNA from seven tissues using Illumina pair-end sequencing, assembled into 57,212 unigenes with an average length of 1010 bp. The annotation and classification results showed that a total of 39,584 unigenes had similar hits to the known proteins of four public databases, and 31,501 were classified into 55 Gene Ontology (GO) functional sub-categories. The search for putative molecular markers among 57,212 unigenes identified 10,276 simple sequence repeats (SSRs) and 64,720 single nucleotide polymorphisms (SNPs). High-quality primers of 1174 SSR loci were designed, of which 88.16% were localized to nine chromosomes of sweet orange. Of 100 SSR primers that were randomly selected for testing, 87 successfully amplified clear banding patterns. Of these primers, 29 with a mean PIC (polymorphic information content) value of 0.52 were effectively applied for phylogenetic analysis. Of the 20 SNP primers, 14 primers, including 54 potential SNPs, yielded target amplifications, and 46 loci were verified via Sanger sequencing. This new dataset will be a valuable resource for molecular biology studies of pummelo and provides reliable information regarding SNP and SSR marker development, thus expediting the breeding program of pummelo.


Introduction
Pummelo (Citrus grandis (L.) Osbeck), also known as the female parent of sweet orange, is an important economical cultivated species and a valuable germplasm for citrus breeding. Pummelo is commercially cultivated in many countries, such as China, Thailand, Japan, Mexico and Israel, because pummelo fruits and juice have abundant antioxidant compounds, including vitamin C, carotenoids, flavonoids and limonoids, which can reduce the risk of oxidative-related diseases [1][2][3][4].
Although pummelo plays an important role in agricultural production and has beneficial health effects, little genomic and transcriptomic information is publicly available. As of January 2015, only 327 nucleotide and 58 EST sequences of pummelo were deposited in the GenBank database. Pummelo is an ancient species of the Citrus genus [5]; additionally, many accessions possess the mechanism of self-incompatibility [6]. Thus, it is difficult to sequence and assemble a whole genomic with a wide range of variability and high degree of heterozygosity. An alternative approach, namely transcriptome sequencing, which is used to generate massive expressed sequence tags (ESTs), was considered.
The transcriptome dataset is a valuable resource not only for expression profile construction but also for novel gene discovery [7][8][9] as well as for a large number of simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) markers [10][11][12]. SSR and SNP molecular markers are very useful for variety identification, population structure analysis, and linkage map development and will contribute to the acceleration of the breeding program. In a previous study, 343 AFLPs and 335 SSRs were used to assess the genetic diversity of 110 pummelo germplasms [13]; thereafter, 178 pummelo genotypes were identified using 25 SNPs [14]. However, most of the existing markers were developed from other species in Citrus, mainly from C. sinensis, and few markers have been developed from pummelo.
Given that the number of the markers derived from pummelo is limited, Biswas et al. [15] first reported a set of pummelo EST sequences and developed corresponding markers through the traditional method of constructing a cDNA library, selecting clones and sequencing by the Sanger method. However, the throughput of the dataset was not sufficient to satisfy the needs of molecular biology and breeding studies. Furthermore, the traditional method was costly, time-consuming and cumbersome. Fortunately, next-generation sequencing (NGS) technologies, including the Roche (454) GS FLX, Illumina genome analyzer and Applied Biosystems SOLiD sequencer, are revolutionizing our ability to obtain massive transcriptome data [16]. Among these high-throughput sequencing technologies, the Illumina platform is the most widely used and is suitable for the discovery of variants and genes because it can produce large volumes of high-quality reads [17]. Moreover, with the support of paired-end sequencing and advanced assemblies, Illumina can be successfully applied to non-model organisms rather than only for model organisms [18][19][20][21].
'Shatian' pummelo is a representative pummelo cultivar with the characteristic of selfincompatibility [22]. In this study, we characterized the transcriptome information of 'Shatian' pummelo using the Illumina platform and developed SSR and SNP markers of pummelo. To our knowledge, this is the first report of a pummelo transcriptome used to construct a gene expression dataset with 12 G data. Such data provide an important resource for novel gene discovery and marker development and lay a solid foundation for future breeding efforts and molecular biology research on pummelo. In addition, we designed corresponding SSR and SNP primer pairs to test the accuracy of the in silico loci and obtained the diversity analysis of 44 citrus and relative accessions through SSR markers.

Illumina Sequencing and de novo 'Shatian' pummelo transcriptome assembly
To provide a comprehensive transcriptome platform for pummelo, we constructed a cDNA library of 'Shatian' pummelo through a mixture of RNA from seven sampled tissues (Fig. 1), i.e., petal, anther, filament, style, ovary, pedicel and leaf, and this library was named 'Cg' in this work and was sequenced using Illumina paired-end technology. The sequencing feature yielded approximately 149 million raw reads. After filtering out ambiguous, low-quality reads and reads with adaptors, the remaining 135,191,154 reads, encompassing 12,167,203,860 total nucleotides, were used for assembly. Trinity, a program for RNA-Seq transcriptome assembly without a reference genome [23], was used for our assembly, resulting in 101,235 contigs that contained 44 Mb sequences with an average length of 440 bp (Table 1). Of these contigs, 68.60% were shorter than 300 bp, 19.39% ranged from 300 to 1000 bp, and the remaining 12.01% were longer than 1000 bp (S1A Fig.).
Then, the contigs were further clustered and constructed into de Bruijn graphs. Each de Bruijn graph was processed independently to extract full-length splicing isoforms, namely, unigenes. Following these steps, we obtained 57,212 unigenes, of which distinct clusters and distinct singletons composed 37.5% and 62.5%, respectively. The mean size of the unigenes was 1010 bp, and 50% of the unigenes (N50) were 1630bp or longer ( Table 1). The length distribution of the unigenes indicates that the assembled unigenes with lengths varying from 200 to 300 bp, 300 to 1000 bp and above 1000 bp accounted for 22.41%, 40.69% and 36.90% of the total, respectively (S1B Fig.).

Annotation of the unigenes
To annotate 57,212 unigenes, a sequence similarity search based on the BLASTx algorithm was conducted against four public databases (i.e., the NCBI non-redundant (Nr) database, Swiss-Prot protein database, Clusters of Orthologous Groups (COG) database, and Kyoto Encyclopedia of Genes and Genomes (KEGG) database) with an E-value threshold of 10 -5 . In total, 39,584 unigenes were annotated to at least one of the mentioned databases, and 11,987 were matched with all of the databases (S2A Fig.). EST Scan was used to determine the sequence   (26,100) were annotated to the Swiss-Prot database. Among these annotated unigenes, 60.06% of Nr mapped, and 50.33% of Swiss-Prot hits had very strong homology, with an E-value1.0E-5 ( Fig. 2A and 2B). The top hits with a similarity greater than 80% against the Nr and Swiss-Prot databases accounted for 28.84% and 13.31%, respectively ( Fig. 2C and 2D). From the Nr results, we found that 23.91% of the unigenes were closely related to Vitis vinifera, followed by Ricinus communis (21.33%), Populus balsamifera subsp. trichocarpa (17.52%) and Amygdalus persica (14.26%) (S2B Fig.).
To further classify the functions of 'Shatian' pummelo unigenes, 39,488 unigenes with Nr annotation were submitted to the GO database consisting of molecular function, cellular component and biological process. A total of 31,501 unigenes were classified into 55 sub-categories with 255,243 functional terms (Fig. 4). Nearly half of the assignments were classified into biological process, and 36.48% and 14.38% fell into cellular component and molecular function, respectively. Under the biological process category, cellular process (19,618; 15.64%), metabolic process (19,292; 15.38%) and single-organism process (13,850; 11.04%) represented the major classifications. Furthermore, 2715 unigenes were assigned to the signaling category. Within the cellular component category, 23,617 unigenes were concentrated in the groups of the cell and the same number of unigenes in the groups of the cell parts. Interestingly, 19,117 unigenes were involved in the group of components of organelles. In the molecular function category, catalytic activity (15,976; 43.52%) and binding (15,027; 40.94%) were the most abundant subcategories, whereas the other groups accounted for a percentage of less than 10% (Fig. 4). To better understand the specific network of these transcripts, we annotated the 57,212 unigenes using the KEGG [24] database. As the results showed, 23,219 unigenes were mapped into 5 main categories and 128 KEGG pathways. As expected, most of the unigenes (19,780; 85.19%) fell into the metabolism category. In this category, global map, carbohydrate metabolism, lipid metabolism, and biosynthesis of other secondary metabolites were the most represented subcategories.
Genetic information processing, involving translation, folding, sorting and degradation, replication and repair, and transcription, contained 25.88% of the assembled unigenes. In addition, 8.95%, 7.63% and 4.94% of the unigenes were aligned to organismal systems, environmental information processing and cellular processes, respectively (S1 Table).
In short, the results of the functional analysis revealed the feasibility of obtaining transcriptome sequences through high-throughput technology, even for non-model organisms with complex genomes. The transcriptome is an important resource for molecular marker development, and further experiments on marker mining are necessary to illustrate the value and to boost the availability of the sequence dataset.
SSR mining and primer design SSR markers are characterized by co-dominance and are highly polymorphic, and they are therefore useful in the field of genetic mapping and diversity analysis. Using the MISA tool, we identified 10,276 potential SSRs distributed in 8405 sequences, of which 1,477 unigenes contained multiple SSR loci. The distribution of all of the SSR loci is shown in Table 2. Tri-nucleotide repeats accounted for 30.41%, closely followed by dinucleotide motifs (26.06%), whereas tetra-, penta-, and hexa-nucleotide repeats represented only approximately 2%. Altogether, 192 repeat types of nucleotide motifs were detected. The AG/CT (1674, 16.29%) type was the most common, excluding repeat types of mononucleotide motifs. To provide convenient access for 'Shatian' pummelo molecular markers, we designed five primers for each potential SSR locus and subsequently eliminated low-quality primers. A total of 1174 SSR loci with high-quality primers were obtained (S2 Table). The distribution of these 1174 SSRs on the nine chromosomes of C.sinensis was analyzed using local BLASTn with a cut-off of 1.0E-5. The results indicate that 88.16% (1035) of the SSR loci were localized, presenting a relatively uniform distribution on nine chromosomes. Chromosome 2, containing 165 loci, represented the highest density (5.35 SSR/Mb), whereas chromosome 8 was the lowest with 72 loci (3.17 SSR/Mb) (Table 3). Furthermore, based on the exact positions of the localized genes on the chromosomes, we drew the chromosome map of the 1035 SSR loci using the Mapchart software (Fig. 5).

SSR marker polymorphism
A total of 100 primer pairs were randomly selected and synthesized for wet-lab validation, and 87 primers successfully produced clear banding patterns upon polyacrylamide gel electrophoresis, of which 82 were of the expected sizes, and the remainder was larger than the expected sizes. Unsurprisingly, the observed sizes of some of the primers were larger than expected because the PCR template was genomic DNA, whereas the primers were designed according to cDNA.
To evaluate the utility of the SSR markers for phylogenetic analysis, 29 SSR primers selected from 82 primer pairs with desirable sizes were used to evaluate the genetic relationship among 44 citrus and relative accessions. The average amplified allele number of 29 SSR primers was 5.4, with a maximum of 10 and a minimum of 2 ( Table 4). The polymorphic information content (PIC) values ranged from 0.35 to 0.79 with a mean value of 0.52 (Table 4). The observed We then calculated the similarity matrix of 158 alleles of 29 polymorphic SSR markers according to Dice's coefficient and clustered the genotypes based on the UPGMA method. As shown in Fig. 6, all of the studied genotypes were clearly different and were classified into six distinct genetic groups with a Dice's coefficient of approximately 0.23. Then, 21 members of C. grandis were grouped into a single distance cluster. C. paradise, C. sinensis, C. aurantium and C. ichangensis were grouped together with C. reticulata, whereas C. aurantifolia, C.limon, C. medica and C. jambhiri were placed in the same cluster. F. hindsii and C. hongheensis were clustered together. P. trifoliata had a considerably distant relationship to any Citrus species, and Severinia buxifolia, as an out group, was placed in a single distinct cluster.

SNP mining and validation
SNP molecular markers are powerful tools for gene mapping and molecular marker-assisted selection because they have the most abundant variations in genomes. In this study, 64,720 putative SNP loci were identified within 19,172 unigenes, with 62.01% unigenes including two or more SNPs. Of the 64,720 SNPs, 61.24% belonged to transition type (C/T and A/G), and 38.75% were transversion type (G/T, C/G, A/T and A/C) (Fig. 7). There were almost as many  (19,830) in the transition types, whereas within the transversion types, A/T type (6,626) was the most common variation, followed by types A/C (6,400), G/T (6,301), and C/G (5,753). The genomic DNA of 'Shatian' pummelo was used as a template to validate the accuracy of the putative SNPs. A total of 20 SNP primer pairs were designed, and 14 primers presumptively including 53 potential SNP loci yielded target amplifications (S4 Table; S5 Table). The results of sequencing these 14 primers by the Sanger method indicated that 48 SNPs (90.57%) whose quality scores varied from 23 to 99 were validated (S5 Table).

Discussion
Illumina paired-end sequencing and assembly NGS technology, which is characterized by unprecedented high throughput, convenience and cost-effectiveness, is likely to become the method of choice for transcriptome sequencing, which is an important molecular biology tool that is widely used in various studies, such as in generating different dynamic views of gene expression [25][26][27], the discovery of novel genes or transcripts [28,29], splicing activity [30], and molecular marker mining [31][32][33][34]. The Illumina platform, as a commercially available NGS technology, was initially used for organisms with reference genomes due to the limitation of short read lengths and the available assembly software [35,36]. In recent years, with the introduction of advanced algorithms, massive short reads have been successfully assembled de novo even without reference sequences [23,37,38]. In addition, the latest paired-end sequencing technology has further increased the sequencing depth and expanded the assembly length [39].
In this study, a transcriptome dataset of the pooled RNA from seven 'Shatian' pummelo tissues was generated using Illumina paired-end sequencing. More than 135 million 90-bp paired-end reads were produced, encompassing 12 Gb of sequence data. The assembly results show that a feasible strategy could be used to assemble a large number of Illumina short reads of non-model woody plants. All of the short reads were assembled into 57,212 unigenes with a mean length of 1010 bp, which was longer than in many other studies of woody plants [40][41][42][43]. The N50 of the unigenes was 1630 bp in our study, whereas Du F et al. [44] obtained 49,991 unigenes with an N50 of 988bp using a similar method. Previous publications [45] indicated that an accurate and effective assembly tends to have a longer mean length and a larger N50 value.

Annotation and Functional Classification of Unigenes
The lack of a reference genome limits the prediction of protein function and utilization of the transcriptome dataset. To overcome this problem, we submitted all of the unigenes to BLASTx analysis against four public databases (Nr, Swiss-Prot, COG and KEGG). As a result, 69.19% of the unigenes had homologous hits with the above protein databases, which was similar to the value for lily (72.20%) [44] but less than for Chinese bayberry (79.55%) [46] and radish (92.09%) [47]. These unigenes that were not annotated to any known proteins may be novel transcripts that are unique to pummelo; however, we cannot exclude the possibility of sequencing error. In our research, 39,488 (69.02%) and 26,100 (45.62%) unigenes had homologous hits with the Nr and Swiss-Prot databases, respectively. Consistent with previous studies [21,40,44], the majority of unigenes matched with the Swiss-Prot database, displaying a reliable annotation in the Nr database. Among all of the pummelo unigenes, CL2175.Contig1_Cg, Unige-ne24287_Cg, and Unigene16912_Cg were the top 3 most abundant transcripts and were uniquely mapped with approximately1.76, 0.98, and 0.45 million reads, respectively. Based on the BLASTx analysis, CL2175.Contig1_Cg and Unigene16912_Cg were separately aligned with certain proteins of C.sinensis, indicating that an accurate transcriptome dataset was obtained due to the confirmed genetic relationship between C.grandis and C.sinensis [48]. The Unigene24287_Cg, without any mapping to any protein database, was assumed to be a highly expressed transcript that is unique to pummelo.
Additionally, 15,317 unigenes were assigned to 25 COG clusters with 29,134 functional terms, and 31,501 unigenes were classified into 55 GO sub-categories with 255,243 functional terms. A total of 23,219 unigenes were mapped to 128 KEGG pathways, in which metabolism represented the richest pathway. Chao Liang et al. [49] identified the key genes involved in fatty acid and unsaturated fatty acid biosynthesis using de novo assembly for the transcriptome dataset. Based on the results of the annotation, we also found that some genes were annotated, such as S-like RNase, CL6572.Contig1_Cg, Unigene4758_Cg, CL89.Contig1_Cg, and Unige-ne9342_Cg, that share a similar sequence with S-RNase, regulating gametophytic self-incompatibility [50,51]. Altogether, the information of transcriptome sequences is valuable for the study of molecular mechanisms; meanwhile, the results demonstrate the effectiveness and reliability of Illumina paired-end sequencing.  [15,52]. In our study, 10,276 potential SSRs and 64,720 candidate SNPs were identified from the sequences that were obtained by the NGS technology. Those SSRs were detected from 8405 sequences with a frequency of one SSR per 5.6 kb. This frequency was similar to the results for other citrus as reported by Chen et al. (1/5.2 kb) [53] and for wheat as observed by Peng et al. (1/5.46 kb) [54]. All of the SNP loci were distributed in 19,172 unigenes with an average of one SNP per 892 bp. The density was similar to salmon louse (1/768 bp) [55] and higher than for wheat (1/3.6 kb) [7] and red clover (1/1.5 kb) [56]. To check the redundancy of the 8405 sequences with SSR loci, a BLASTn alignment was conducted against the 212 pummelo EST sequences that were used for SSR mining by Biswas et al. and Chai et al. [15,52]. Only 2.82% of the sequences exhibited homology, indicating that our SSR set made a great contribution to SSR identification for C.grandis.

Markers identification and validation
The primers of SNP and SSR were designed to test the validity of the in silico loci. Because SSRs have been extensively and effectively used in genetic analysis, we developed corresponding primers for 1174 SSR loci to provide easy access for researchers; additionally, 88.16% of the loci were localized uniformly to nine chromosomes of C.sinensis. Then, 100 SSR primers were randomly selected, of which 87 yielded successful bands, and 82 were of the expected size. The success rates of amplification were different among the various plants: 100% for peanut [57], 90% for sweet potato [21], 82% for alfalfa [58], and 72% for apricot [59]. Furthermore, of the 20 SNP primers, 14 successfully amplified target fragments, and more than 85% containing low-and high-quality loci were confirmed via Sanger sequencing, which was slightly higher than the 76% accuracy reported by Yu et al. [60]. In total, the results indicated that NGS technologies can be used to develop abundant SNP or SSR markers with high efficiency and accuracy.
SSR marker genetic diversity analysis SSR markers were subjected to genetic diversity analysis, and markers with at least 3 alleles were identified, which was better than the previous result by Chai et al. [52]. The PIC value in our study (0.52) was greater than 0.5, suggesting that the majority of SSR markers were at a high polymorphism level and were suitable for genetic diversity study. However, the PIC was lower than in Chai's report (0.87) [52]. The possible reason for this result is that fewer markers were used, and the primers were selected randomly rather than intentionally selected for high polymorphism.
According to the hypothesis of Barrett and Rhodes [61], there are only three major ancestral citrus species, citron (C.medica L.), mandarin (C.reticulata Blanco) and pummelo (C.grandis Osb.). Other cultivated species were derived from hybridization between these major ancestral species and closely related genera [62]. Our findings support this hypothesis. From the dendrogram, we found that pummelo, mandarin, and citron were clearly distinguished from each other and were allied with three independent clusters. All of the pummelos were grouped together and formed a single cluster, excluding 'huyou' (C.paradisi Macf.). In our results, 'huyou' exhibited a close relationship with sweet orange and sour orange, confirming the hypothesis that 'huyou' is a hybrid between pummelo and sweet orange. Sweet orange, sour orange and mandarin all belong to the same cluster; therefore, our findings indicate that, as many researchers have assumed [48], mandarin has a significant genome contribution to these two types of orange. Lime, lemon, and limonia had close relationships with citron, indicating that citron might be a parent species [63][64][65]. Moreover, trifoliate orange, kumquat, and papeda were separated from citrus, although there were some similarities in the morphological characteristics and metabolic composition between citrus and trifoliate orange, kumquat, and papeda [66][67][68]. Taken together, these SSR markers that were developed based on the transcriptome dataset could elucidate a relatively unambiguous genetic relationship among Citrus and its relatives.

Conclusion
This work is the first transcriptome study on the non-model plant C.grandis using the NGS technology, which provides a comprehensive available transcriptome dataset expressed in the floral organs and young leaves. In total, 57,212 unigenes were assembled, of which 39,584 could be annotated to known proteins of the public databases. Based on the dataset, we identified the top three most abundant transcripts and some S-like RNase genes that were related to the self-incompatibility mechanism. Moreover, a large number of SNPs and SSRs were identified, and high-quality primers of 1174 SSR loci were designed, 88.16% of which were localized to the chromosomes of sweet orange. Ultimately, a relatively unambiguous genetic relationship among citrus and its relatives was elucidated via 29 SSR markers that were developed from the transcriptome dataset. The enrichment results not only indicate that the assembled data were reliable and accurate and that massive short reads could be effectively assembled without reference sequences but also demonstrate that the transcriptome dataset could provide abundant genetic resources for novel gene identification and marker development.

Methods
Plant material and RNA and DNA extraction 'Shatian' pummelo was cultivated in the Guangxi Citrus Research Institute in Guilin City, Guangxi, China. The sampled tissues included young leaves and six floral organs (Fig. 1), i.e., petal, anther, filament, style, ovary, and pedicel, collected separately 10 days before the anthesis stage of the flowers. To examine the polymorphism of the detected SSR markers in our study, fresh young leaves of 41 citrus accessions and 3 related species (Table 5) were collected. All of the tissue and leaf samples were immediately frozen in liquid nitrogen and stored at −80°C until further extraction. The RNA of each tissue was isolated in terms of the protocol of Liu et al. [69], and the RNAs of the seven different tissues were equally pooled. The DNA was extracted from young leaves using the method described by Cheng et al. [70].

Library construction and sequencing
The transcriptome sequencing of the constructed library was performed by the Beijing Genomics Institute (BGI, Shenzhen, China) according to the manufacturer's instructions. Initially, RNA was treated with DNase I. Poly (A) mRNA was isolated from the treated RNA using magnetic oligo (dT) beads, and enriched mRNA was then broken into short fragments by mixing with the fragmentation buffer. Subsequently, first-strand and second-strand cDNA was successively synthesized using these fragments as templates. The double-stranded cDNA was resolved for end-repair, and a single adenine was added to the 3' end. To distinguish between different samples, sequencing adapters were ligated to the short fragments, and suitable fragments of approximately 200 bp were selected for PCR amplification. An Agilent 2100 Bioanalyzer and an ABI StepOnePlus Real-Time PCR System were used to validate the quantification and qualification of the sample library. Finally, the cDNA library was sequenced using Illumina HiSeq 2000. The sequence data were deposited in the NCBI Sequence Read Archive (http://www. ncbi.nlm.nih.gov/Traces/sra) under accession number SRP051793.

Data filtering and de novo assembly
To separate high-quality sequences from raw reads for further analysis, we conducted a filtering process that discarded reads containing adaptors or unknown nucleotides that were larger than 5% and low-quality reads with more than 20% of the bases of quality value 10.The short read assembly program Trinity [23] was used for de novo assembly of the clear reads through the following steps. First, clean reads with a certain length of overlap were extended to form longer contigs, and these contigs were clustered and constructed into individual de Bruijn graphs. Then, the reads were mapped back to the contigs for quality validation. Based on the outcome of the validation, the contigs with little mapping were removed, and the high-quality contigs were further assembled into unigenes. Finally, the obtained unigenes were divided into two classes, namely clusters and singletons, according to the gene family clustering. The clusters were named with 'CL' as the prefix, followed by an ID number, whereas singletons were prefixed with 'unigene'. Additionally, the similarity among unigenes from the same cluster was greater than 70%.

Function annotation and classification
The homology searches between the assembled unigenes and the NCBI non-redundant (Nr) database, Swiss-Prot protein database, Kyoto Encyclopedia of Genes and Genomes (KEGG) database and Clusters of Orthologous Groups (COG) database were conducted using the BLASTx algorithm with an E-value of less than 10 -5 , and the best alignments were used to identify the sequence direction. A priority order of Nr, Swiss-Prot, KEGG and COG was followed when the results of different databases conflicted. If the unigenes were unaligned to any of the four public databases, then the software ESTScan [71] was introduced to determine the sequence direction. Based on the Nr results, the GO annotation of all of the unigenes was performed via the Blast2GO [72] program. Subsequently, WEGO [73] software was used to produce a GO functional classification at the macro level. To annotate and classify the possible functions, the unigenes were also aligned to the COG database. Pathway assignments were performed according to the KEGG pathway database.

Marker development and Primer design
A total of 57,212 de novo sequences were examined to mine for SNP and SSR loci. SSR loci were detected using MicroSAtellite (MISA; http://pgrc.ipk-gatersleben.de/misa/misa.html). In our study, mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides were searched with a minimum of 12, 6, 5, 5, 4, and 4 repeat units, respectively. We first designed 5 primer pairs for each locus using the software Primer 3 and then filtered low-quality primers (S2 Table). Local BLASTn was used to check the redundancy and analyze the chromosome localization with a cut-off of 1.0E-5. The distribution results of the localization on nine chromosomes of C.sinensis were depicted using the Mapchart 2.2 software.

Marker amplification and validation
To validate the putative markers, we randomly selected and synthesized 100 SSR primer pairs (S3 Table) and 20 SNP primer pairs (S4 Table). The genomic DNA of 'Shatian' pummelo was used as the template for marker validation. The SSR amplification reactions were conducted according to the protocol by Chai et al. [52]. The PCR products were screened using denaturing 6% polyacrylamide gel electrophoresis and stained by ethidium bromide following the protocol of Ruiz et al. [75]. For potential SNP validation, a 20-μl PCR volume was used, containing 1.2 μl of genomic DNA, 0.5 μM each primer, 1×reaction buffer, 200μM each dNTP and 2.5 U FastPfu DNA Polymerase. PCR amplification was performed in a MJ-PTC-200 tm thermal controller (MJ Research, Waltham Mass) using the following steps: 94°C for 10 min, 31 cycles of 94°C for 30 s, 55°C for 30 s, 72°C for 30 s, and a final step at 72°C for 10 min. The amplified products were subjected to 2% agarose gels, and then the products, which were detected using targeted fragments, were sequenced bidirectionally (forward and reverse) in a 10-μl reaction by the Beijing Genomics Institute (BGI, Shenzhen, China).

SSR polymorphism analysis
The polymorphism on 41 citrus accessions and 3 related species was tested using 29 SSR primers (Table 4) selected from the 82 primers with expected sizes. The PCR amplifications and products detection were performed as previously described. A 0/1 matrixcoding the absence/ presence of each band was constructed for further data analysis. The polymorphic information content (PIC), number of alleles, observed heterozygosity (Ho) and expected heterozygosity (He) were calculated through PowerMarker version 3.25 [76]. A genetic similarity matrix was analyzed using the NTSYSpc 2.1 software [77], and a dendrogram was constructed using the UPGMA method.