Transcriptomic Analysis of a Tertiary Relict Plant, Extreme Xerophyte Reaumuria soongorica to Identify Genes Related to Drought Adaptation

Background Reaumuria soongorica is an extreme xerophyte shrub widely distributed in the desert regions including sand dune, Gobi and marginal loess of central Asia which plays a crucial role to sustain and restore fragile desert ecosystems. However, due to the lacking of the genomic sequences, studies on R. soongorica had mainly limited in physiological responses to drought stress. Here, a deep transcriptomic sequencing of R. soongorica will facilitate molecular functional studies and pave the path to understand drought adaptation for a desert plant. Methodology/Principal Findings A total of 53,193,660 clean paired-end reads was generated from the Illumina HiSeq™ 2000 platform. By assembly with Trinity, we got 173,700 contigs and 77,647 unigenes with mean length of 677 bp and N50 of 1109 bp. Over 55% (43,054) unigenes were successfully annotated based on sequence similarity against public databases as well as Rfam and Pfam database. Local BLAST and Kyoto Encyclopedia of Genes and Genomes (KEGG) maps were used to further exhausting seek for candidate genes related to drought adaptation and a set of 123 putative candidate genes were identified. Moreover, all the C4 photosynthesis genes existed and were active in R. soongorica, which has been regarded as a typical C3 plant. Conclusion/Significance The assembled unigenes in present work provide abundant genomic information for the functional assignments in an extreme xerophyte R. soongorica, and will help us exploit the genetic basis of how desert plants adapt to drought environment in the near future.


Introduction
Understanding the genetic basis of how organisms adapt to climate change is one of the most challenging tasks [1,2]. Although many attempts have exploited the adaptation mechanism in the species with known genome sequences, the molecular basis of adaptation in the non-genomic species is still poorly understood [1,2], especially in the plants from arid regions which contain plenty of potential genetic resources for ecology engineering. As change and human activity [14,16]. R. soongorica has undergone desertification of Asia which initiated at least 22 million years ago according to the palaeomagnetic measurements and fossil evidence [17]. During the process of adaptation to desertification, R. soongorica has evolved specific traits including extremely thick cuticle, hollow stomata, specialized leaf shape, deep root system, and effective physiological mechanisms such as reduced transpiration rate, increased water use efficiency, and maintaining the stem vigor to survive desiccation by leaf abscission [7,[18][19][20][21]. Much effort has been made in R. soongorica to elucidate the mechanism of drought adaptation during last decade, however, due to paucity of genomic information, most of the previous studies have limited to its physiological characteristics [21][22][23][24][25][26][27]. Little work had focused on the genetic diversity based on neutral markers (RAPD [28], ISSR [29,30] and cpDNA [31]). However, all these studies failed to dissolve the adaptive evolution of R. soongorica. Therefore, transcriptome sequences are in urgent to supply sufficient functional genomic information to address systemically the genetic mechanisms of drought adaptation of R. soongorica.
In this study, the R. soongorica transcriptome was sequenced by the Illumina paired-end sequencing technology (Illumina Hi-Seq TM 2000 platform). A total of 4.8 gigabases raw data was assembled into 173,700 contigs and further constructed into 77,647 unigenes (mean length = 677 bp, N50 = 1109 bp). Moreover, 123 unigenes were identified to be potentially involved in drought adaptation. To our surprise, all the C 4 photosynthesis genes were existed and active in R. soongorica which has been regarded as a typical C 3 plant [32]. The R. soongorica transcriptomic information provides a prime reference point for the subsequent exploitation of this important genetic resource and will facilitate to unravel the mechanism of adaptation to extreme arid environment.

Sequencing and de novo Assembly
To obtain a global overview of the R. soongorica transcriptome, a pooled cDNA library representing the inflorescences, leaves, and seedlings was constructed, and then sequenced on the Illumina HiSeq TM 2000 platform. A total of 4.8 gigabases dataset was generated from 53,193,660 clean paired-end reads with length of 90 bps and Q20 over 96% (Table 1). This suggested that the sequencing output and quality were good enough for further analysis.

Sequence Annotation
Several complementary approaches were utilized to validate and annotate the assembled unigenes. The unigenes were first aligned to the National Center for Biotechnology Information (NCBI) non-redundant protein (Nr) database, non-redundant nucleotide sequence (Nt) database, and the Swiss-Prot protein database with E-values less than 1e-5. Among the 77,647 unigenes, 41,582 (53.55%), 28,197 (36.31%) and 25,297 (32.58%) unigenes were significantly matched to the known genes in the Nr, Nt and Swiss-Prot protein databases, respectively (Table  S1 and Table 2). The E-value distribution of the top hits in the Nr database showed that 46.01% of the sequences were mapped to the known genes in plants with best hits (E-value,1e-50, mean identity is 69.54%), and approximately 16% of unigenes can hit deposited sequences with similarity over 80% (Figure 2A and 2B). About 82% of annotated unigenes can be assigned with a best score to the sequences from the top seven species, i.e., Vitis vinifera (41.65%), Ricinus communis (13.46%), Populus trichocarpa (11.39%), Glycine max (8.12%), Medicago truncatula (3.29%), Arabidopsis thaliana (2.08%) and A. lyrata subsp. lyrata (1.78%) ( Figure 2C). Interestingly, the phylogenetic relationship based on the internal transcribed spacer (ITS) also showed R. soongorica in among the other rosids species firstly diverged from V. vinifera, though the bootstrap value of their relationship is below 50% ( Figure 2D).
GO and COG classification. To identify functional categories among the 77,647 unigenes, all the best BLAST hits were input into the Gene Ontology (GO) Software Blast2GO for GO functional enrichment analysis by performing Fisher's exact test [34,35]. In total, 38.21% of unigenes (29,666) could be assigned to gene ontology classes with 202,607 functional terms ( Figure 3, Table S2). Interestingly, cellular process (p = 0.007), metabolic process (p = 0.007) and response to stimulus (p = 0.009) are strong significantly overrepresented in the 29 biological process GO groups.
For further functional prediction and classification, all of the 77,647 unigenes were aligned to the Clusters of Orthologous Groups of proteins (COG) category. Overall, 19.30% of unigenes (14,987) were assigned into 25 COG categories with 28,537 COG functional terms ( Figure 4 and Table S3). The categories including transcription (2,455,8.60%), carbohydrate transport and metabolism (1,795, 6.29%), signal transduction mechanisms (1,788, 6.27%) and secondary metabolites biosynthesis, transport and catabolism (860, 3.01%) were identified, which might be related to response for drought stress in plants.
KEGG pathway mapping. To further gain insights into the biological functions and interactions of our unigenes, a pathwaybased analysis was performed in the light of the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway database which based on the roles in biochemical pathways. There were 23,569 (30.35%) out of the 77,647 unigenes were mapped to 128 KEGG pathways. Among them, 5,056 unigenes (21.45%) were related to metabolic pathways (Ko01100, no maps in KEGG database), 2,460 (10.87%) to biosynthesis of secondary metabolites (Ko01100, no maps in KEGG database). The highest representation with KEGG map was Plant-pathogen interaction (Ko04626, 1,156 unigenes, 4.90%), followed by Plant hormone signal transduction (Ko04075, 1,113 unigenes, 4.72%) and RNA transport (Ko03013, 1,002 unigenes, 4.25%) (Table S4). To be mentioned, 57 core enzymes was detected in the biosynthetic pathway of flavonoid (Ko00941, 228 unigenes) which involved in secondary metabolism under abiotic stresses in plants [36] ( Figure  S2). Also, all of the core components were found in the circadian rhythm pathway (Ko04712, 192 unigenes), which is crucial for timing of flower and budset in response to the seasonal rhythm of temperature and light length [37,38] (File S1). R. soongorica has been regarded as a C 3 plant based on the photosynthesis characteristics [39], but all the core genes of C 4 carbon fixation were found in our transcriptome, surprisingly (Ko00710, 155 unigenes, File S2). Absisic acid (ABA) is a crucial hormone involved in many stress responses [40]. The key enzymes in its biosynthetic and catabolic pathways (Ko00906) and receptor genes (Ko04075) were discovered as well (Table S5 and File S3).
Rfam and Pfam analysis. From Nr, Nt, Swiss-Prot, GO, COG, and KEGG databases, more than half of the unigenes (42,839, 55.17%, mean length was 975 bp, Table 2 and Figure 5A) were annotated, while 34,808 (44.83%) unigenes (mean length was 334 bp, Figure 5A) failed to be annotated. As shown in Figure 5B, the annotated transcripts are significant more abundant in the pool (ANOVA p value = 2.2e-16).

Functional Genes Related to Drought Adaptation
To screen functional genes related to drought adaptation through three main strategies (drought escape, drought avoidance and drought tolerance; reviewed in [41,42]), candidate genes from Arabidopsis were local BLASTed against 77,647 unigenes in R. soongorica. A total of 123 unigenes homologous to 113 Arabidopsis candidate genes potentially involved in drought adaptation were identified in our transcriptomic dataset (Table S5). Among them, 46 putative flowering time unigenes were identified to be involved in drought escape (E-value,1e-5) which is a common costeffective strategy to avoid drought stress in natural populations     [41,42]. A total of 40 unigenes were found to be potentially involved in drought avoidance (E-value,1e-5), which regulate the biogenesis and development of the cuticle (eight unigenes), stomata (six unigenes), and trichomes and root system (26 unigenes). For drought tolerance strategy, the ABA-dependent and -independent pathways have been extensively studied. Totally, 32 unigenes were involved in ABA-dependent pathways, including biosynthetic and catabolic genes, ABA receptors, while eight unigenes were involved in ABA-independent pathway. These genes will be helpful for exploiting the genetic mechanism of how R. soongorica adapts to drought natural environment in northern China.

Discussion
In this study, a large amount of R. soongorica transcriptomic unigenes (77,647) were sequenced with Illumina HiSeq TM 2000 platform ( Table 1). The N50 length of unigenes was 1,109 bp and the average length was 677 bp; these results were comparable to the recently published plant trancriptomic analysis, such as Hevea brasiliensis (N50 = 436 bp, average length = 485 bp [43]) and Dendrocalamus latiflorus Munro (N50 = 1,132 bp, average length = 736 bp [44]). Up to date, Trinity is one of most powerful packages prevailed in de novo assembly of short reads, especially in dealing with alternative splicesomes and paralogs [33]. A total of 24,271 (31.26%) unigenes were generated as clusters which might correspond to putative alternative spliced transcripts and/or paralogous transcripts (''CL''-unigene). The number of clusters with only two types of ''CL''-unigenes (i.e. CL1.Contig1_A and CL1.Contig2_A) was notably high in our dataset (5,662 clusters, Figure S3). A considerable number of clusters with more than ten types of ''CL''-unigenes were also found in our dataset ( Figure S3). For example, thirteen ''CL2023.Contigs'' showed high identities (E-value,1e-41, Identity.78%) with Arabidopsis cryptochrome 1 (CRY1) ( Table S5, File S5), which plays a crucial role in sensing seasonal change of blue light and UV-A to initiate flower properly [45,46]. The thirteen contigs can be divided into three types based on sequence divergence with identities less than 90% (File S5). Together with the overrepresented numbers of the clusters with more than two types of ''CL''-unigenes, these results suggest that a large amount of paralogous transcripts were presented in our dataset, indicating at least a whole genomic duplicate might happen in the evolutionary history of R. soongorica. In addition, three alternative splicing sites were found in the alignment of cluster ''CL11.contigs'' which contained 32 contigs (File S6), showing that splice-isoforms were also produced by the Trinity assembly. Of course, imperfect assembly of short reads by the Trinity cannot be ruled out.
More than half of the unigenes (42,839, 55.17%) were successfully assigned as annotated genes by BLASTing against with public databases Nr, Nt, Swiss-Prot, GO, COG and KEGG, given the absence of genomic information of R. soongorica (Table 2). Notably, the percentage of annotated is the lowest in among the previous studies with the same sequencing strategy during the last year (58.24 to 78.9%, [44,47,48]). One possibility of lacking annotation is due to the technical limitation, such as sequencing depth and read length [49], which was common in the all studies with de novo transcriptome analysis [44,47,48]. We did find the unannotated sequences were averagely much shorter than the annotated unigenes (334 bp vs 975 bp, Figure 5A). However, there was still a considerable percentage of unigenes (7.96%, 2,770 of 34,808) with length over 500 bp and reads per kilobase per million reads (RPKM) over three failed to hit any homologs in the known plant species ( Figure 5). In addition, there were 173 unigenes contained at least one Pfam domain (E-value,1e-5, File S4), among which two reverse transcriptase domains RVT_3 and zf-RVT were highly represented with 26 and 19 times, respectively. These results suggested that a considerable portion of genes in R. soongorica might originate from novel retrotransposon mechanisms [50,51], which not found in any known genomes yet, and the high frequency of reverse transcriptase genes could be an indispensable part of R. soongorica genome (778 Mb, 2n = 22 [13]).
So far, V. vinifera was the highest related species with known genome to R. soongorica ( Figure 2C), but fewer than half of annotated unigenes in R. soongorica hit protein sequences in V. vinifera ( Figure 2C). This is consistent with that these two species were classified into different orders, Caryophyllales (Tamaraceae, Reaumuria) and Vitales (Vitaceae, Vitis) [52,53], and with their low bootstrap value support in the ITS phylogenetic tree ( Figure 2D). Therefore, the vast un-annotated unigenes (44.83%, 34,808 of 77,647) could only be novel genes compared with the known genomes and specific in the genome of the relict Tertiary plant R. soongorica. Functional and expressional studies by Digital Gene Expression analysis and real-time PCR are needed to further corroborate this hypothesis in detail.
After deep sequencing and exhaustive annotation, this endeavor provided a large amount of unigenes that will facilitate to exploit genetic resources in the functional studies and to identify candidate genes responsible for drought adaptation in R. soongorica. Candidate genes out of the 77,647 unigenes in R. soongorica involved in three drought adaptation strategies (drought avoidance, drought escape and drought tolerance; reviewed in [41,42]) were analyzed (Table S5). At least eight unigenes were possibly involved in the formation of the thick cuticle on R. soongorica leaf surface (8.3 mm [18]), which plays an important role in regulating the exchange of gases and water in plants and can enhance tolerance to drought [54]. To be mentioned, three unigenes were found as homologs of HvABCG31/Eibi1, an ATP-binding cassette subfamily G full transporter (also found in KEGG, ko02010), which is essential for the cutin formation and the preservation leaf water in wild barley [55]. Stomata, trichomes and root hairs are crucial for water usefulness and maintenance under drought environment (reviewed in [54,56]), and the molecular mechanisms of the differentiation of these tissues have been extensively studied in Arabidopsis [57][58][59]. FAMA is required for the terminal differentiation of guard cells [60]. Glabra 1 (GL1) is an important regulator of trichome initiation in Arabidopsis [61]. The development of root hairs and trichomes is regulated by GL2, GL3, Enhancer of Glabra 3 (EGL3) and Transparent Testa Glabra 1 (TTG1) with similar molecular mechanisms in Arabidopsis [62]. Here, a set of unigenes potentially involved in the biogenesis and development of these structures were identified (Table S5), which will facilitate to disentangle the formation of the specific traits such as hollow stomata and deep root system [18] and to further understand the molecular mechanism of drought avoidance strategy in desert plants.
Drought stress can promote plants flowering earlier [63][64][65]. In our field survey, flowering time of R. soongorica obviously varied between the populations along a precipitation gradient. The plants in Haishiwan, Gansu province (annual precipitation 600 mm) were still flowering in November, while the plants in Shashichang, Gansu province (precipitation 180 mm) had finished flowering and started to disperse seeds in September (data not shown). We propose that the genes control flowering time might have undergone strong natural selection in R. soongorica populations from extreme arid regions. Finally, including all the core components in circadian rhythm in plants KEGG pathway (Ko04712, File S1), 29 out of 51 flowering time genes in Arabidopsis can hit their homologous unigenes in R. soongorica (Table S5). Further investigating the correlation between the genetic diversity of these candidate genes and the variation of flowering time along drought gradient could shed light on how R. soongorica adapts to natural arid environment by drought escape strategy.
The molecular response to drought stress has been studied intensively in model plants, and at least two pathways are suggested to be involved into drought tolerance: ABA-dependent and -independent pathways (reviewed in [66,67]). In this study, most of the key genes related to ABA biosynthesis, catabolic and receptor complex were identified by KEGG annotation and local BLAST (File S3 and Table S5), such as rate-limiting enzyme 9-cisepoxycarotenoid dioxygenases (NCED) in the biosynthetic pathway [68], receptor complex components pyrabactin resistance/PYR1-like/regulatory components of ABA receptor (PYR/PYL/RCAR) [69], protein phosphatase 2Cs (PP2Cs) to sensor ABA signal [70], and important kinases and transcription factors (i.e. SNF1-related protein kinase 2s (SnRK2s) [71], ABRE-binding factor/ABA-responsive elements (ABF/ AREBs) [72]). All of these key unigenes may function in ABA signal transduction in response to drought stress as in model plants. Furthermore, several transcription factors like Dehydration Responsive Element Binding/C-repeat Binding Factors (DREB/CBFs) which are involved in ABA-independent pathways were identified by local BLASTing (Table S5). These genes will help to uncover the molecular basis of physiological responses to drought stress in R. soongorica.
The C 4 pathway has been acknowledged to be more adaptive than the C 3 pathway in response to abiotic stresses, such as high temperature, radiation and drought [73,74]. The C 3 and C 4 photosynthesis can occur simultaneously in different leaves within the same plant [75], and the transition between the C 3 and C 4 pathways can also occur in some C 3 plants under some conditions (i.e. Eleocharis vivipara [76], Flaveria brownii [77]). Nevertheless, the origin of C 4 pathway and the transition between the C 3 and C 4 pathways remained elusive because of the absence of genetic evidence. R. soongorica is a C 3 plant according to its physiological characteristics [32]. In this study, all of the genes encode key enzymes in C 4 carbon fixation pathway were presented in the transcriptomic dataset from the annotation of KEGG (File S2). The lengths of the C 4 genes were not statistically different with the C 3 genes (ANOVA p = 0.83), but the C 3 genes were significantly abundant than C 4 genes (p = 2.7e-05, Figure 6). This is partially concordant to the previous studies which characterized R.
soongorica as a C 3 plant [32,78,79]. To our knowledge, the present and active of the C 4 genes in R. soongorica might be the first transcriptomic evidence to support that a Tamaricaceae plant could also orchestrate the C 3 and C 4 pathways in response to environmental changes [76,80,81].

Conclusion and Perspectives
Desert plants have attracted more and more attentions, because they contain plenty of potential genetic resources to adapt to the extremely harsh conditions in their habitats. In the recent years, transcriptome sequencing became a most powerful and efficient approach to uncover genomic information in non-model organisms [82,83], few studies (but see [84]) focused on exploitation of the molecular basis of drought adaptation of desert plants. In present study, 77,647 unigenes were generated from a Tertiary relic species R. soongorica with the Illumina HiSeq TM 2000 platform and more than half of unigenes has been annotated. At least 123 candidate genes related to drought adaptation were identified by the KEGG annotation and local BLAST, and population genetic study on these candidate genes will help us to better understand the adaptive evolutionary mechanism of R. soongorica. Expression and function analysis of the un-annotated unigenes will be also employed to unravel the specific drought adaptation mechanism in R. soongorica. These endeavors will advance our knowledge of a dominant plant species coping with the global climate change in the fragile desert ecosystems.

Ethics Statement
R. soongorica is a species widely distributed in Shashichang, Jingtai County, Gansu province and other arid regions, and it is not included into any list of endangered or protected species. Before collecting the samples, an oral permission was got from the local management of forestry after applying with introduction letters of CAREERI (Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences).

Plant Materials
Leaves and inflorescences of R. soongorica ( Figure S4A) were collected in wild field, Shashichang, Jingtai County, Gansu province, northwest of China (37u219410N,104u89110E), where the average annual precipitation is 180 mm from 1950 to 2000 (http://www.worldclim.org/). Tissues were immediately frozen in liquid nitrogen for later extraction of total RNA. R. soongorica seeds from the same place were planted on damp filter papers and incubated at 4uC for 4 days before being placed at 23uC under long-day (16 h light/8 h dark) condition with a photosynthetic photon flux density of 150 mmol m 22 s 21 . Two weeks after germination, seedlings were harvested for RNA isolation ( Figure  S4B).

RNA Extraction, Library Preparation and RNA-seq
Total RNA was extracted with E.Z.N.AH Plant RNA Kit (Omega Bio-tek, Doraville, GA, USA). The concentration and quality of each RNA sample were determined by NanoDrop 2000 TM micro-volume spectrophotometer (Thermo Scientific, Waltham, MA, USA) and gel electrophoresis. One sample of total RNA was extracted from mature plant tissues including flowers and leaves. Another sample was from whole seedlings including roots, hypocotyls, and cotyledons. The two RNA samples were pooled equally to construct the cDNA library with a final concentration of 610.4 ng/ml.
The poly (A) mRNA was enriched by magnetic Oligo (dT) beads, and then was interrupted into 200-700 nt fragments. Using these short fragments as templates, the first cDNA strand was synthesized by random hexamers primers, followed by the secondstand cDNA synthesis using DNA polymerase I (New England BioLabs) and RNase H (Invitrogen). The short DNA fragments were purified with a QiaQuick PCR extraction kit (QIAGEN Inc., Valencia, CA, USA) for following end repairing and tailing A. Then the DNA fragments were ligated to sequencing adaptors, and the DNA fragments with required length were purified by agarose gel electrophoresis and gathered by PCR amplification.

De novo Assembly and Expression Profiling
The clean reads were obtained after filtering adapter sequences, reads with 5% ambiguous sequences 'N', and low-quality reads (reads with a base quality less than Q20), with a custom PERL script. Then, Trinity, a package consisting of Inchworm, Chrysalis and Butterfly, was used to perform the de novo assembly of highquality clean reads [33]. The command-line parameters used were ''-seqType fq -min_contig_length 200-CPU 5-bflyHeapSpace-Max 4G -JM 20G''. Short reads with overlapping sequences were firstly assembled by Inchworm to form the longest contigs without gaps. These contigs were pooled to build into de Brujin graphs by Chrysalis. According to the paired end information, Butterfly reconciled the de Bruijn graphs and output longer sequences without overlooking the possibility of splice forms and paralogous transcripts. Such sequences which cannot be extended on either end were defined as unigenes. After clustering, the unigenes were divided into clusters (prefix is ''CL'') and singletons (prefix is ''Unigene'').
RPKM of each unigene was normalized by ERANGE3.1 software to determine the unigene expression profiles [85].

Functional Annotation
All sequences were annotated by aligning with public protein and nucleotide databases, such as the NCBI Nr, Nt database, the Swiss-Prot protein database, the KEGG database, and the COG with an E-value cutoff of 1e-5. Based on the alignment results, the further annotation analysis with GO terms, which described biological processes, molecular functions and cellular components, was performed by Blast2GO software [35,86]. The distribution of the gene functions was plotted by WEGO [87].

Identification of Putative Candidate Genes Involved in Drought Adaptation
Generally, plants take three main strategies (drought escape, drought avoidance and drought tolerance) to adapt to drought conditions (reviewed in [41,42]). To uncover the potential candidate genes related to the three strategies in R. soongorica, genes in model plant Arabidopsis involved in flowering time (drought escape), epidermal development such as stomata, cuticle waxes, trichomes and root hairs (drought avoidance), and ABA and drought stress signals (drought tolerance), were selected to screen the potential orthologs from the unigene dataset by Local BLASTN with E-value cutoff of 1e-5 (ftp://ftp.ncbi.nlm.nih.gov/ blast/executables/LATEST-BLAST/).