Transcriptome Sequencing and Identification of Cold Tolerance Genes in Hardy Corylus Species (C. heterophylla Fisch) Floral Buds

Background The genus Corylus is an important woody species in Northeast China. Its products, hazelnuts, constitute one of the most important raw materials for the pastry and chocolate industry. However, limited genetic research has focused on Corylus because of the lack of genomic resources. The advent of high-throughput sequencing technologies provides a turning point for Corylus research. In the present study, we performed de novo transcriptome sequencing for the first time to produce a comprehensive database for the Corylus heterophylla Fisch floral buds. Results The C. heterophylla Fisch floral buds transcriptome was sequenced using the Illumina paired-end sequencing technology. We produced 28,930,890 raw reads and assembled them into 82,684 contigs. A total of 40,941 unigenes were identified, among which 30,549 were annotated in the NCBI Non-redundant (Nr) protein database and 18,581 were annotated in the Swiss-Prot database. Of these annotated unigenes, 25,311 and 10,514 unigenes were assigned to gene ontology (GO) categories and clusters of orthologous groups (COG), respectively. We could map 17,207 unigenes onto 128 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Additionally, based on the transcriptome, we constructed a candidate cold tolerance gene set of C. heterophylla Fisch floral buds. The expression patterns of selected genes during four stages of cold acclimation suggested that these genes might be involved in different cold responsive stages in C. heterophylla Fisch floral buds. Conclusion The transcriptome of C. heterophylla Fisch floral buds was deep sequenced, de novo assembled, and annotated, providing abundant data to better understand the C. heterophylla Fisch floral buds transcriptome. Candidate genes potentially involved in cold tolerance were identified, providing a material basis for future molecular mechanism analysis of C. heterophylla Fisch floral buds tolerant to cold stress.


Introduction
The genus Corylus consists of deciduous species which naturally occur in temperate forest areas in Europe, the Middle East, Asia, and North America [1]. Currently over 4,000,000 t of nuts are commercially produced through the world, of which 700,000 t are hazelnut production [2]. Hazelnuts are important woody species in Northeast China. Its cultural area is approximately one million hectares, ranking first in the world. As the most widely distribution Corylus plant in China, Corylus heterophylla Fisch yield accounts for more than 70% of the total output in the domestic market of China [3]. Hazelnuts, due to their organoleptic characteristics, constitute one of the most important raw materials for pastry and chocolate industry [2,4]. In addition to providing a desirable flavor to different food, they play an important role in human nutrition and health due to their protein, oil, vitamin, and mineral content. Hazelnuts are rich in both monounsaturated and polyunsaturated fatty acids, as well as in vitamin E [5]. Corylus species are also important sources of taxol (paclitaxel), which is an effective yet relatively expensive medicine for treatment of breast, ovarian, and lung cancer [6][7][8].
Cold stress is one of the most severe abiotic stresses and adversely affects plants by causing tissue injury and delayed growth [9,10]. Cold stress can be classified as chilling (,20uC) and freezing (,0uC) stress [11]. Not all plants are always ready to tolerate freezing temperatures. However, many plants are tolerant of freezing temperature after exposure to non-freezing low temperature, a phenomenon called cold acclimation (CA) [12,13]. Arabidopsis cold acclimate only about 5-7uC allowing for brief exposures to freezing temperatures whereas woody perennials can withstand extremely low subzero temperatures for extended periods of time. In addition, overwintering floral buds display both enhanced freezing tolerance and dormancy/relief of dormancy [14]. In such process, various physiological and biochemical changes occur in plant cells, which may confer subsequent acquired chilling and freezing tolerance to plants [15]. The signaling pathways used by plants in responding to cold stress and the key genes for modifying the response are of interest [16]. The best characterized regulon of cold-stress responses in plants contains transcription factor C-repeat binding factor dehydrationresponsive element-binding protein (CBF/DREB) and its coldinducible target genes, known as COR (cold-regulated gene), KIN (cold-induced gene), RD (responsive gene to dehydration), or LTI (low-temperature-induced gene) [17][18][19][20][21][22]. A large number of studies demonstrate that gene expression changes occur in a wide range of plant species in cold response [15], but the precise hierarchical organization of the global network has not been defined.
Hazelnuts will grow in a wide range of soil types from acidic mountain soils to basic soils derived from limestone. The plants grow best in mild, humid climate without extremes of heat or cold. However, buds, leaves, catkins, and female flowers are tolerant of frosts down to 27uC [23]. However, little is known about its tolerance mechanisms. During the last decades, large amounts of transcriptomic and genomic sequences have been available in many model organisms. For Corylus heterophylla, only 90 nucleotide sequences have been deposited in GeneBank database (as of the May 25, 2014). Therefore, extensive genomic or transcriptomic sequences are badly needed for Corylus heterophylla, which can be used for new genes discovery, gene localization, and comparative genomics and so on.
Transcriptome is the complete collection of transcripts in a cell at a specific developmental stage, which provides valuable and comprehensive information on gene expression, gene regulation and amino acid content of proteins [24]. The development of sequencing technology has provided a novel method for the analyses of transcriptome [25]. In plants, RNA-seq has accelerated the investigation of the complexity of gene transcription patterns, functional analyses and gene regulation networks [26]. In the present work, an RNA-seq project for C. heterophylla Fisch was initiated. Four C. heterophylla Fisch floral buds samples, including floral buds in non-cold acclimation (NA), cold acclimation (CA), midwinter (MW), and deacclimation (DA) stages were sequenced using the high-throughput Illumina deep sequencing technique. In addition, we estimated the expression profiles of key genes involved in cold acclimation. The transcriptome sequencing from C. heterophylla Fisch may help improve future genetic and genomic studies on the molecular mechanisms behind the cold tolerance of the C. heterophylla Fisch floral buds.

mRNA-seq and de novo transcriptome assembly
To obtain a global overview of the C. heterophylla Fisch floral buds transcriptome, a cDNA library was generated from an equal mixture of RNA isolated from floral buds in the four stages during winter (including NA, CA, MW, and DA), and paired-end sequenced using the Illumina platform. After stringent quality assessment and data filtering, 25,221,054 of 75-bp reads (,1.85 G) with 97.38% Q20 bases (those with a base quality greater than 20) were selected as high quality reads for further analysis. An overview of the sequencing is presented in Table 1. The high quality reads produced in this study have been deposited in the National Center for Biotechnology Information (NCBI) SRA database (accession number: SRX529300). Using the Trinity de novo assembly program [27], nextgeneration short read sequences were assembled into 82,684 contigs, with N50 length of 642 bp and with mean length of 325 bp ( Table 2). The distribution of contigs is shown in Fig. 1A. In total, there were 6,208 contigs coding for transcripts longer than 1 kb and 1,323 contigs coding for transcripts longer than 2 kb. The contigs were subjected to cluster and assembly analyses. A total of 40,941 unigenes were obtained, among which 9,323 genes (22.8%) were greater than 1 kb. The length distributions of unigenes are shown in Fig. 1B, revealing that more than 17,757 unigenes (43.4%) are greater than 500 bp. An overview of the assembled contigs and unigenes is presented in Table 2. These results demonstrated the effectiveness of Illumina pyrosequencing in rapidly capturing a large portion of the transcriptome. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given unigene and the number of reads assembled into it (Fig. 1C).

Transcriptome annotation
To determine protein-coding transcripts we screened the C. heterophylla Fisch floral buds transcriptome against the NCBI Non-redundant (Nr) peptide database and Swiss-Prot protein database using BLASTx with a cutoff E-value of 10 25 . Mapping of 30,549 (74.6%) of the unigenes to the Nr library suggests that most of the unigenes can be translated into proteins. Furthermore, 18,581 (45.4%) unigenes had significant matches in the Swiss-Prot database ( Table 3). Distribution analysis based on BLASTx searches showed that the unigenes of C. heterophylla Fisch have homologs in numerous hit a lot of plant species (Fig. 2). Among the various plant species that have protein sequence information in GenBank, the unigenes of C. heterophylla Fisch had the highest number of hits to sequences from Amygdalus persica (26.4%), followed by Vitis vinifera (24.6%), Ricinus communis (10.4%), Populus trichocarpa (9.5%), Fragaria vesca (7.6%), Glycine max (6.2%), and Cucumis sativu (5.1%) (Fig. 2). The high similarity of C. heterophylla Fisch unigenes to genes from Amygdalus persica and Vitis vinifera suggests the possibility of using the genome of Amygdalus persica or Vitis vinifera as a reference for identifying different gene expression patterns of mRNA-seq data.

Functional classification by GO
To assign functional information to transcripts, Gene Ontology (GO) analysis was carried out, which provides a dynamic, controlled vocabulary and hierarchical relationships for the representation of information on biological process, molecular function, and cellular component, allowing a coherent annotation of gene products. There were 30,549 unigenes annotated in Nr database, among which 25,311 unigenes were assigned with one or more GO terms, with 49.0% for biological process, 40.8% for molecular function, and 10.1% for cellular component ( Fig. 3 and Fig. 4A). For biological process, metabolic process (GO:0008152) were the most represented GO term, followed by cellular process (GO:0009987). Regarding molecular function, unigenes with binding activity (GO:0005488) and catalytic activity (GO:0003824) were highly represented. For cellular components, the most represented category was cell (GO:0005623) and cell part (GO:004464) (Fig. 3).
Hardy plants develop essential tolerance for cold survival through multiple levels of biochemical and cell biological changes. These responses are due to reprogramming of gene expression which results in the adjusted metabolic alternations. The first step in switching on such molecular responses is to perceive the stress as it occurs and to relay information about it through a signal transduction pathway [28]. To explorer the unigenes might be involved in signal transduction and stress responses, we then compared the GO terms related with signal transduction (
To further demonstrate the usefulness of C. heterophylla Fisch unigenes generated in the present study, we identified biochemical pathways represented by the unigene collection. Annotations of C. heterophylla Fisch unigenes were fed into the KEGG Pathway Tools, which is an alternative approach to categorize gene functions with the emphasis on biochemical pathways. This process predicted a total of 128 pathways represented by a total of 17,207 unigenes. Summary of the sequences involved in these pathways was included in Table S3. The top 3 pathways included Plant hormone signal transduction (756 unigenes), RNA transport (727 unigenes), and spliceosome (713 unigenes) (Fig. 6). As shown in Fig. 7, multiple C. heterophylla Fisch unigenes (red rectangle) were involved in the process of spliceosome assembly. Some made up the key components of spliceosome assembly including U1, U2, U4, U5 and U6 etc. Some other unigenes, such as Prp5, UAP56, Prp2, Prp16, Prp17, Prp18, Prp22, Slu7, and Prp43, directly participated in the process of spliceosome assembly. As C. heterophylla Fisch floral buds undergoing cold acclimation during winter and preparing for flower organs differentiation, this result showed that versatile alternative splicing events may occur in the C. heterophylla Fisch floral buds, which suggested that alternative splicing regulation is a general approach to affect complex plant biological processes including plant development, disease resistance and stress responses etc. Altogether, 31,844 (77.8%) unigenes were successfully annotated in the Nr, Swiss-Prot, COG, KEGG, and GO databases listed in Table S1.
Candidate genes involved in cold tolerance in C. heterophylla Fisch floral buds As we known, the response of plants to any environmental signal is mediated by a series of reactions, collectively referred to as signal transduction [28]. After perception of the cold signal, the downstream transcription factors and response genes were reprogrammed and then result in the adjusted metabolism. To identify the genes likely involved in cold tolerance in C. heterophylla Fisch floral buds, we construct a candidate cold tolerance gene set based on the GO terms representation. A large number of studies indicated that the homologs of selected candidate genes involved in plant cold tolerance ( Table 4). The selected unigene ID and annotation are listed in Table 4.
To validate the responsible of genes in the candidate cold tolerance gene set to cold, we then selected ten genes from the set and detected their expression pattern under cold stress and during winter by qRT-PCR. In woody perennials of the temperate zone, cold acclimation is triggered by several environmental cues, not only low temperatures, and is generally considered a two-step process: The first stage is induced by short photoperiod and the timing and speed of acclimation can be affected by other such as  [29]. To exclude the influence of other factors, we first detected the time course expression patterns of these genes under cold stress (4uC). As shown in Fig. 8A, in C. heterophylla Fisch leaves, NIR and ADF were rapidly induced at 2h after treated by cold stress, while bZIP78, CIP, GPAT, and COR413 were dramatically induced at 4h after treated by cold stress. On the whole, all the ten cold tolerance candidate genes were induced in different degree in C. heterophylla Fisch leaves when they were exposed to the cold stress. It suggests that these genes are responsible to cold stress. We then analyzed their expression pattern during overwintering in C. heterophylla Fisch floral buds. During winter, the C. heterophylla Fisch floral buds undergoing four stages: NA, CA, MW, and DA (see methods). These selected genes can be classified into three types according their expression patterns during the process of cold acclimation (Fig. 8B). (1) Type I: the expression of bZIP78, CIP, NIR, GPAT, Hsp22, COR413, and BAM were induced immediately in CA stage and were further induced in the following MW stage, indicating that these genes could response to early and later cold acclimation. (2) Type II: ERD7 had higher expression in CA stage than MW stage, indicating ERD7 might be mainly involved in early cold acclimation. (3) Type III: HD-ZIP and ADF were significant induced in MW stage, but were not induced in CA stage, implying their involvement in later response to cold stress. In type I class, the mRNA level of membrane protein (COR413) and transcription factors (bZIP78) were immediately induced when the C. heterophylla Fisch floral buds into the CA stage and were further induced in MW stage. The cold-regulated (COR)413plasma membrane and COR413-thylakoid membrane groups are potentially targeted to the plasma membrane and thylakoid membrane, respectively. It is known that the plasma membrane is the primary site of freezing injury [30]. As an integral membrane protein, COR413-PM could play a structural role by stabilizing the plasma membrane lipid bilayer [31]. Many studies have shown that bZIP transcription factor play an important role in the ability of plants to withstand various stresses. In soybean plants, GmbZIP44, GmbZIP62, and GmbZIP78 can bind to GLM (GTGAGTCAT), ABRE (CCACGTGG), and PB-like (TGAAAA) elements with differential affinity. They may function in ABA signaling through upregulation of ABI1 and ABI2 and play roles in salt and freezing tolerance through regulation of various stress responsive genes [32]. The early response of these transcription factors could enhance the expression of a series downstream stress responsive genes rapidly and then enhance the cold tolerance of C. heterophylla Fisch floral buds. Type II class gene ERD7 mainly induced in CA stage but not in MW stage. Early responsive to dehydration (ERD) genes are defined as genes that are rapidly activated during drought stress. ERD15 from Arabidopsis has been functionally characterized as a common regulator of the abscisic acid (ABA) response and the salicylic acid (SA)-dependent defense pathway [33]. The intense induction of ERD7 in CA stage implying the hormone signal involved in early cold acclimation in C. heterophylla Fisch floral buds. In type III class, the actin depolymerizing factors (ADF) are part of the ADF/cofilin group, a family of small proteins (15-22 kD) that includes cofilin, destrin, depactin, and actophorin. The members of this family can be described as stimulus-responsive modulators of the cell actin cytoskeleton dynamics. Using Arabidopsis ADF1, Carlier et al. [34] have suggested that one of the main functions of ADF is to increase the turnover rate of actin filaments. In wheat, the induction of an active ADF during cold acclimation and the correlation with an increased freezing tolerance suggest that the protein may be required for the cytoskeletal rearrangements that may occur upon low temperature exposure [35]. Moreover, the kinetics of TaADF protein expression is different from the mRNA expression pattern [36]. During the acclimation period, TaADF mRNA level is maximal after 1 or 2 days and slowly decreases afterward, whereas protein accumulation increases and peaks at 49 days in the hardy cultivars. In this study, the expression of ADF was induced in MW stage in C. heterophylla Fisch floral buds,

Conclusion
In this study, de novo transcriptome sequencing of the C. heterophylla Fisch floral buds using Illumina platform was performed for the first time. 28,930,890 raw reads were de novo assembled into 40,941 unigenes. All unigenes were then evaluated and functionally annotated by comparing with the existing protein databases, such as NCBI Nr database, Swiss-Prot database, COG database, and KEGG database. A large number of candidate genes potentially involved in growth, development, and stress tolerance were identified, and are worthy of further investigation. To our knowledge, this is the first application of Illumina pairedend sequencing technology to investigate the transcriptome of C. heterophylla Fisch floral buds and moreover the assembly of the reads was conducted without reference genome information. The database will improve our understanding of the molecular mechanism of cold tolerance in C. heterophylla Fisch floral buds. This resource should lay an important foundation for future genetic or genomic studies on Corylus genus.   For the short-term cold treatment, the seedlings were grown in a growth chamber with 25uC/21uC (day/night), 16 h light/8 h dark cycle. For the exposure to cold stress, four-weeks-old seedlings were transferred to a cold chamber at 4uC. The leaves were collected at 0, 2, 4, 8, and 24 h of cold stress. All of the experiments were repeated at least three times.
Total RNA was extracted from floral buds using the RNeasy Plant Mini kit (Qiagen, Valencia, CA, USA). DNA contamination was removed with RNase-free DNase I (Qiagen). RNA was concentrated and purified with an RNA MinElute kit (Qiagen). RNA quality and quantity were assessed by absorption at 260 nm/280 nm, gel electrophoresis, and via the Agilent 2100 Bioanalyzer (Agilent Technologies, USA).

mRNA-seq library construction for illumine sequencing
The mRNA-seq library was constructed following the manufacturer's instructions of mRNA-seq Sample Preparation Kit (Cat# RS-930-1001, Illumina Inc, San Diego, CA) (Illumina). Briefly, mRNA was purified from 20 mg of total RNA using oligo (dT) magnetic beads. Following purification, the mRNA is fragmented into small pieces using divalent cations under elevated temperature. Taking these short fragments as templates, firststrand cDNA was synthesized using reverse transcriptase and     Sequence data analysis and de novo assembly The raw reads were cleaned by removing adaptor sequences, empty reads and low quality sequences, which included the reads with N percentage (i.e., the percentage of nucleotides in read which could not be sequenced) over 10% and ones containing more than 50% nucleotides in read with Q-value#5. Transcriptome de novo assembly was performed separately with the short reads assembling programs SOAPdenovo [37] and Trinity [27]. It has been demonstrated that Trinity is a more efficient de novo transcriptome assembler, especially in the absence of a reference genome [27]. First, Trinity combined the reads with a certain overlap length to form longer fragments, which were called contigs. Next, these reads were mapped back to contigs; with paired-end reads, Trinity was able to detect contigs from the same transcript and determine the distances between these contigs. Finally, Trinity connected these contigs into sequences that could not be extended on their end. Such sequences were defined as unigenes.
As the Trinity assembler discards low coverage k-mers, no quality trimming of the reads was performed prior to the assembly. Trinity was run on the paired-end sequences with the fixed k-mer size of 25, minimum contig length of 100.

Gene annotation and classifications
The optimal assembly results were chosen according to the assembly evaluation. The assembled sequences were compared against the NCBI Nr database and Swiss-Prot database using BLASTn with an E-value of 10 25 . Gene names were assigned to each assembled sequence based on the best BLAST hit (highest score). To increase computational speed, such search was limited to the first 10 significant hits for each query. The ORFs were identified as the nucleotide sequence or as the protein translation provided by the ''GetORF'' program from the EMBOSS software package [38]. The longest ORF was extracted for each unigene. We quantified transcript levels in Reads Per Kilobase of exon model per Million mapped reads (RPKM) [39]. The RPKM measure of read density reflects the molar concentration of a transcript in the starting sample by normalizing for RNA length and for the total read number in the measurement. Genes with high expression levels were screened and listed.
To annotate the assembled sequences with GO terms describing biological processes, molecular functions and cellular components, the Swiss-Prot BLAST results were imported into Blast2GO [40], a software package that retrieves GO terms, allowing gene functions to be determined and compared. These GO terms are assigned to query sequences, producing a broad overview of groups of genes catalogued in the transcriptome for each of three ontology vocabularies, biological processes (BP), molecular functions (MF) and cellular components (CC). The unigenes sequences were also aligned to the COG database to predict and classify functions. KEGG pathways were assigned to the assembled sequences using the online KEGG web server (http://www. genome.jp/kegg/) [41]. The output of KEGG analysis includes KO assignments and KEGG pathways that are populated with the KO assignments.

Quantitative RT-PCR
Total RNA was isolated from C. heterophylla Fisch floral buds during four stages (NA, Non-cold Acclimation; CA, Cold Acclimation; MW, Midwinter; DA, Deacclimation) with the RNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA). cDNA synthesis was performed with 1 mg total RNA using the Superscript III First Strand Synthesis system followed by the RNase H step (Invitrogen, Carlsbad, USA), according to the manufacturer's protocol. Primer pairs were designed using Primer3 (http://frodo.wi.mit.edu/ primer3/) with the following parameters: Tm of approximately 60uC, product size range of 100-260 base pairs, primer sequences with a length of approximately 20 nucleotides, and a GC content of 45-55%. The gene names and primers used for qRT-PCR are liste in Table S4. To quantify the expression level the selected genes, the C. heterophylla Fisch Actin was used as an internal control. qRT-PCR was performed using a 7500 Real-time PCR System (Applied Biosystems, CA, USA) and a SYBR Premix Ex Taq Kit (TaKaRa, Dalian, China). The relative quantitative method (DDC T ) was used to calculate the fold change of the target genes [42]. Three biological replicates (nested with three technical replicates) per sample were carried out.