Genome Survey Sequencing for the Characterization of the Genetic Background of Rosa roxburghii Tratt and Leaf Ascorbate Metabolism Genes

Rosa roxburghii Tratt is an important commercial horticultural crop in China that is recognized for its nutritional and medicinal values. In spite of the economic significance, genomic information on this rose species is currently unavailable. In the present research, a genome survey of R. roxburghii was carried out using next-generation sequencing (NGS) technologies. Total 30.29 Gb sequence data was obtained by HiSeq 2500 sequencing and an estimated genome size of R. roxburghii was 480.97 Mb, in which the guanine plus cytosine (GC) content was calculated to be 38.63%. All of these reads were technically assembled and a total of 627,554 contigs with a N50 length of 1.484 kb and furthermore 335,902 scaffolds with a total length of 409.36 Mb were obtained. Transposable elements (TE) sequence of 90.84 Mb which comprised 29.20% of the genome, and 167,859 simple sequence repeats (SSRs) were identified from the scaffolds. Among these, the mono-(66.30%), di-(25.67%), and tri-(6.64%) nucleotide repeats contributed to nearly 99% of the SSRs, and sequence motifs AG/CT (28.81%) and GAA/TTC (14.76%) were the most abundant among the dinucleotide and trinucleotide repeat motifs, respectively. Genome analysis predicted a total of 22,721 genes which have an average length of 2311.52 bp, an average exon length of 228.15 bp, and average intron length of 401.18 bp. Eleven genes putatively involved in ascorbate metabolism were identified and its expression in R. roxburghii leaves was validated by quantitative real-time PCR (qRT-PCR). This is the first report of genome-wide characterization of this rose species.


Introduction
Presently, about 100-250 species have been described in the genus Rosa, many of which are recognized for their ornamental horticultural use [1]. The chromosome number of members of this genus are based on multiples of seven and range from 2n = 2x = 14 to 2n = 8x = 56 [2]. Rosa roxburghii Tratt (2n = 2x = 14), which is widely distributed in Southwest China, has Province, P. R. China (20136006-1, AHM), http://kjt. gzst.gov.cn; and the natural science foundation in Guizhou Province, P. R. China (20122169, LM), http:// www.nsfc.gov.cn. The funders provided support in the form of salaries for authors [AMH and LM], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the 'author contributions' section.
Competing Interests: The authors have declared that no competing interests exist.
Paired-end library with insert size of 220 base pairs (bp) was constructed from randomly fragmented genomic DNA, following the standard protocol (Illumina, Beijing, China). Sequence data was generated by Beijing Biomarker Technologies Co., Ltd. (Beijing, China), using an Illumina HiSeq 2500 sequencing platform. The read length was 126 bp, and clean reads were obtained after filtering and correction of the sequence data, and were relatively accurate for estimating the size of the genome, repetitive sequences, and heterozygosis. Then, based on Kmer analysis, information on peak depth and the number of 17-mers was obtained. Its relationship was expressed by using the following algorithm: Genome size = K-mer num/Peak depth [20].
Sequence assembly and guanine plus cytosine (GC) content analysis SOAPdenovo software [21] and Abyss were applied for genome assembly with the pre-processed reads, where k-mer sizes of 31,54,63,70,77, and 83 were examined using default parameters, and the optimal k-mer size was selected from the N50 length. The usable reads > 200 bases in length were selected to realign the contig sequences because the sequences < 200 bp were likely to be derived from repetitive or low-quality sequences. Then, the paired-end relationship between reads was coincident between contigs. The scaffolds were constructed step by step using insert size paired-ends. The 10-kb non-overlapping sliding windows along the assembled sequence were used to calculate GC average sequencing depth.

Repetitive sequences
Due to the relatively low conservatism of the repetitive sequence among species, a particular repetitive sequence database was built to predict repeat sequences. The software programs LTR_FINDER [22], MITE-Hunter [23], RepeatScout [24], and PILER-DF [25] were used to construct a de novo repeat library, classified by PASTEClassifier [26], and combined with the Repbase transposable element library [27] to act as the final library. Then, the software Repeat-Masker [28] was run to find homologous repeats in the final library. SSR motifs were identified using the SciRoKo software [29] in the 'MISA' mode, with default parameters. The minimum numbers of SSR repeats for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides adopted for identification were 14, 7, 5, 4, 4, and 4, respectively.

Gene prediction and annotation
For de novo prediction, after filtering scaffolds of < 1000 bp in size, Genscan [30] and Augustus were used to predict genes with parameters trained on R. roxburghii. Then, BLAST alignment was performed between predicted genes and common databases such as Nt, Nr, TrEMBL, Swiss-Prot, Pfam, 'euKaryotic clusters of Orthologous Groups' (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG), plant Gene ontology (GO), and Clusters of Orthologous Groups (COG). Meanwhile, the described genes were classified into the KOG slim categories, the GO categories, and then mapped onto the KEGG reference pathways as described by Hirakawa et al. [31]. For homology-based prediction, protein sequences for Malus×domestica, Pyrus bretschneideri, Fragaria vesca, Prunus mume, Prunus persica, and Vitis vinifera were downloaded from publicly available databases. The putative genes of R. roxburghii were clustered by using OrthoMCL [32] with the unigene sets of apple, pear, strawberry, Prunus mume, and peach. Single-copy protein sequences of R. roxburghii and the 6 other species were used to construct the evolutionary tree by using the software PHYML [33].

Genome sequencing and genome size estimation
After the sequence data was filtering and correction, a total of 30.29 Gb clean reads were generated from the small-insert (220 bp) library, with 95.14% Q20 bases (base quality > 20), about 62.99× coverage (Table 1), much greater than 30× coverage, which was required for successful assembly. All of the clean data were used for K-mer analysis. For the 17-mer frequency distribution (Fig 2), the number of K-mers was 26,445,309,972, and the peak of the depth distribution was at 54.98×. The estimated genome size was 480.97 Mb, which was calculated by using the following formula: Genome size = K-mer num/Peak depth. Similarly, a certain repeat rate could cause a repeat peak at the position of the integer multiples of the main peak,~106×, so the genome size of repetitive sequences was estimated to be 291.49Mb, which was about 60.60% of the R. roxburghii genome. In addition, the heterozygosis rate could cause a sub-peak at a position half of the height of the main peak,~26×, which indicates about 0.18% of the heterozygosis rate in this genome.

Sequence assembly and GC content analysis
All of the clean reads and the software SOAPdenovo and Abyss were used to carry out de novo assembly. Assembly with k-mer 77 by SOAPdenovo was selected, as it has the optimal reading for N50 (S2 Table), which is defined as a weighted median and is the smallest contig/scaffold size in the set whose combined length totals 50% of the genome assembly, to produce a contig K-mer (k = 17) analysis for estimating the genome size of R. roxburghii. The x-axis is depth (X); the y-axis is the proportion that represents the frequency at that depth divided by the total frequency of all depths. The genome size was estimated by using the formula: Genome size = K-mer num/Peak depth, and the heterozygosis rate causes a sub-peak at a position half of that of the main peak, whereas a certain repeat rate can cause a similar peak at the position of multiple integers of the main peak. with the N50 of~1.48 kb, and a total length of~405.81 Mb (Table 2). A sequence was also generated, with the scaffold N50 length of~3.55kb and a total length of~409.36 Mb. The total gap length (Ns) was~3.55 Mb. The average GC content of R. roxburghii genome was~38.64% (Table 2), which was higher than that of ants (33.7-37.7%) [35,36] and potatoes (34.8-36.0%) [31,37], lower than that of human (41%) and Nasonia vitripennis (40.6%) [38], but similar to that of date palm (38.5%) [39] and Australian kangaroo (38.8%) [40]. Therefore, the R. roxburghii genome was of mid-GC content. A too high (>65%) and too low (<25%) GC content may cause sequence bias on the Illumina sequencing platform, thus seriously affecting genome assembly [41]. Moreover, the GC depth was slightly blocked into 2 layers (Fig 3), which was in part caused by a 0.18% heterozygosity rate. Maybe only one of the two sets of homologous chromosomes in the diploid was assembled, which resulted in the emergence of the lower layer [21].

Repetitive sequences
The total length of repetitive sequences was~147.89 Mb (Table 3), which was about 47.55% of the R. roxburghii genome, and lower than that of other plant species such as pear (51.3%) [18], Lotus japonicus (56.8%) [42], potato (64.2%) [37], apple (67%) [14], tomato (68.3%) [43]. In   addition, this length was also lower than the estimated number of K-mer (60.60%; Fig 2), which could be the limitations of the assembling effect that resulted in the loss of 21.54% of the repetitive sequences during assemble. 90.84 Mb transposable elements (TE) were obtained, comprised 29.20% of the genome (Table 3), in which retroelements and DNA transoson were identified. Retroelements, also called class I transposable element (Table 3), comprised 24.90% of the genome. And DNA transposons, also named class II transposable element (Table 3), comprised only 4.30% of the genome. Long terminal repeats (LTRs) were observed to be the most abundant repeat elements, comprised 16.74% of the genome, in which 6.90% was gypsy, 9.76% was copia and other LTRs occupied only 0.08% (Table 3). The ratio (0.71:1) of gypsy-like to copia-like elements was calculated. There were 1.15% SSRs and 16.92% uncharacterized repeats (Table 3).
A total of 167,859 SSRs were identified and among which mono-nucleotide repeats showed predominant type, which accounted for 66.30% of the observed SSRs, followed by the di-(25.67%), tri-(6.64%), tetra-(1.08%), penta-(0.16%), and hexa-(0.15%) nucleotide repeats (Table 4). Mono-nucleotide repeats have been reported to be the most common type of repeats whether in monocot species, such as rice, sorghum, and Brachypodium or in dicot species, for example, Arabidopsis, Medicago, and Populus, which accounted for 79% in Medicago at most [44]. The mono-, di-and tri-nucleotide repeats contributed to nearly 99% of SSRs in R. roxburghii, and a very small portion was contributed by tetra-, penta-and hexa-nucleotide repeats. Moreover, 363 motif types were identified in R. roxburghii genome, including 2 of mono-, 8 of di-, 30 of tri-, 80 of tetra-, 91 of penta-, and 152 of hexa-nucleotide repeats (S3 Table). Within the dinucleotide repeat motifs, the AG/CT was most abundant, which accounted for 28.81%, followed by GA/TC at 27.71% (Fig 4). And among the trinucleotide repeat motifs, the common motifs were GAA/TTC and ATT/AAT, accounting for 14.76% and 13.55%, respectively (Fig 5).

Gene prediction and annotation
Based on the genome of R. roxburghii, with a filtering scaffold of < 1,000 bp for de novo prediction, program Augustus got a predicted gene number of 20,589, and a total of 22,721 genes were predicted by Genescan (Table 5). We choose Genescan for further analyses. The identified genes have an average length of 2,311.52 bp, an average exon length of 228.15 bp, and intron length of 401.18 bp. The number of predicted genes in the genome of R. roxburghii was much  [17]. It has been reported that the insufficient sequence depth coverage, variable regulation of gene expression levels, and low sequence homology because of limited gene information from closely related species might be possible reasons [45]. Of the 22,721 predicted genes in the R. roxburghii genome, 17,637 genes matched known genes in common databases, of which, 11,622 had Swiss-Prot homologs, 16,173 had TrEMBL homologs, and 23.38% (5084) were unknown ( Table 6). A total of 7,040 genes were identified by GO slim analysis and further classified into the categories of molecular function, cellular component, and biological process (Fig 6). First of all, around 48.70% of the genes were grouped under biological processes, in which metabolic process was the most highly represented group. Secondly, 29.46% of the genes were grouped under cellular components, in which cell part and cell were the most significantly represented groups. Finally, 21.84% of the genes were grouped under molecular functions, in which catalytic activity represented a relatively high proportion.
Of the putative R. roxburghii genes, 12,419 were clustered with predicted genes that were identified in other species, whereas the remaining 738 were not clustered and therefore considered as R. roxburghii-specific genes (Fig 7), which was far more than that of Prunus persica (302), Malus×domestica (399), and Prunus mume (580), but much lower than that of Pyrus bretschneideri (1,221). The evolutionary relationships among species (S2 Fig) proved that there was a closer relationship between rosebush R. roxburghii and herbaceous strawberry.

Discussion
Flow cytometry has been regarded as a standard method for the prediction of the genome size of plants [46]. However, in the recent years, the development of the NGS technology has provided researchers an affordable means of addressing a wide range of questions relating to emerging and non-model species. In addition, the k-mer method has been successfully applied for the estimation of genome size using NGS reads without prior knowledge of the genome size. Such approach has been utilized in the analysis of the genomes of Gracilariopsis lemaneiformis [45], Cucumis sativus [47], and Myrica rubra [48]. The genome size estimated by K-mer depth distribution of sequenced reads is generally consistent with that of flow cytometry bretschneideri, and P. mume. The first number under the species name is the total number of putative genes subjected to clustering. The second number is the clustered family number. The overlapping areas represent sequences clustered with other species, and the number of non-overlapping areas represents specific genes.
Fruit trees are perennial, and the majority of these are highly heterozygous; therefore, the assembly of fruit tree genomes is relatively difficult using the WGS strategy. Homozygous materials for genome sequencing were always in priority [17,45,51], although a complicated bacterial artificial chromosome (BAC) approach could resolve problems associated with the assembly of a heterozygous genome [18]. Based on the feasibility of estimating heterozygosity from low-coverage genome sequence [52], a heterozygosity rate of 0.18% was observed in the R. roxburghii, which was higher than that of other sequenced plants such as pigeon pea (0.067%) [53], Prunus mume (0.08%) [16], but much lower than that of black cottonwood (0.26%) [54] and date palm (0.46%) [39], which could be utilized in genome studies using the WGS strategy.
Several investigations have determined a genome size range of 294-782 Mb for at least 33 rose species and several cultivars [55]. This observed variability in genome size is not likely due to differences in gene numbers but rather to variations in non-coding sequences such as the intron size [56], and a variety of other factors, including the copy number of TEs, the amount or size of SSRs, the size of inter-enhancer spacers, and the number of pseudogenes [57]. For example, the observed genome size difference between apple and pear is mainly due to repetitive sequences that are predominantly contributed by TEs, whereas the size of the genic regions is similar in both species [18]. In addition, different TE compositions, especially the composition of LTRs, resulting from TE multiplication, may cause genome size changes, which might have large effects on speciation [58,59]. In the present study, The ratio (0.71 to 1) of gypsy to copia LTRs in R. roxburghii was remarkably lower than that observed in peach (1.16 to 1) [17], strawberry (1.20 to 1) [15], pear (1.99 to 1) [19], and apple (4.58 to 1) [14]. These results could contribute to the understanding of Rosaceae genome evolution [60]. The TE content in R. roxburghii was 29.20%, which was similar to that of peach (29.60%) [17]. In addition, the amount of LTRs, which comprised 16.74% of the R. roxburghii genome, was similar to that of strawberry (~16%) [15]. However, the genome size of R. roxburghii was 2-fold larger than that of the two species. This difference in genome size might not be due to the amount of TEs or LTRs, but the composition of LTRs. Meanwhile, SSRs comprised 1.15% of the R. roxburghii genome, which was significantly larger than that observed in apple (0.27) [14], Pyrus bretschneideri (0.22) [18], and Pyrus communis (0.04) [19], and might have potentially led to the genome expansion of R. roxburghii.
Genomic SSR markers, reliable, highly polymorphic, often multi-allelic, and easy to amplify, are widely used in genetic diversity, genetic map construction and so on [53]. However, the lack of available genomic resources in R. roxburghii impeded the use of microsatellite markers. To date, the limited EST-SSR markers were developed for R. roxburghii [9], but no genomewide SSR markers have been published. Presently, the genome survey based on NGS is an especially useful method to explore SSR markers for tree crops [61].
Compared to fruits [9,34], R. roxburghii leaves undergo a higher level of active oxidation loss and recycling of AsA. The L-galactose pathway is considered as the dominant route for AsA biosynthesis in several plant species [62], and GGP may play a key role in the L-galactose pathway in R. roxburghii fruits [9]. In the present study, GGP was not highly expressed in AsAabundant aged leaves, although the level of GLDH expression was similar to the variable pattern of T-AsA content. Besides, the discovery of GUR and MIOX genes in the present study suggests that R. roxburghii can use GalUA or myo-inositol as an initial substrate in AsA biosynthesis, implying that multiple pathways were involved in AsA metabolism in R. roxburghii leaves. In addition, GUR via the GalUA pathway played important roles in AsA biosynthesis in strawberry fruits [63]. APX, which encodes a well-recognized enzyme, catalyzes the oxidation of AsA with high specificity, and AAO, which encodes another vital redox enzyme, also catalyzes the oxidation of apoplast AsA in the presence of oxygen. These two upregulated genes might have caused DHA accumulation in mature leaves (Fig 8).
This is the first report of genome-wide characterization in the genus Rosa. Among the 100-250 species in this genus, R. roxburghii is most important in terms of its horticultural, nutritional, and medicinal value. However, its limited genomic information has constrained genetic studies of R. roxburghii. A total of 167,859 SSRs and 22,721 genes derived from the R. roxburghii genome survey could help in the construction of high-density linkage maps and in conducting gene-based association studies. In addition, the generated dataset could contribute to the understanding of Rosaceae genome evolution. Evaluation of the expression of candidate genes involved in AsA metabolism may improve our understanding of the molecular mechanisms underlying ascorbate accumulation in R. roxburghii.

Author Contributions
Conceived and designed the experiments: HMA. Performed the experiments: ML LLL. Analyzed the data: ML HMA. Contributed reagents/materials/analysis tools: ML LLL. Wrote the paper: ML HMA.