Genome-wide mining seed-specific candidate genes from peanut for promoter cloning

Peanut seeds are ideal bioreactors for the production of foreign recombinant proteins and/or nutrient metabolites. Seed-Specific Promoters (SSPs) are important molecular tools for bioreactor research. However, few SSPs have been characterized in peanut seeds. The mining of Seed-Specific Candidate Genes (SSCGs) is a prerequisite for promoter cloning. Here, we described an approach for the genome-wide mining of SSCGs via comparative gene expression between seed and nonseed tissues. Three hundred thirty-seven SSCGs were ultimately identified, and the top 108 SSCGs were characterized. Gene Ontology (GO) analysis revealed that some SSCGs were involved in seed development, allergens, seed storage and fatty acid metabolism. RY REPEAT and GCN4 motifs, which are commonly found in SSPs, were dispersed throughout most of the promoters of SSCGs. Expression pattern analysis revealed that all 108 SSCGs were expressed specifically or preferentially in the seed. These results indicated that the promoters of the 108 SSCGs may perform functions in a seed-specific and/or seed-preferential manner. Moreover, a novel SSP was cloned and characterized from a paralogous gene of SSCG29 from cultivated peanut. Together with the previously characterized SSP of the SSCG5 paralogous gene in cultivated peanut, these results implied that the method for SSCG identification in this study was feasible and accurate. The SSCGs identified in this work could be widely applied to SSP cloning by other researchers. Additionally, this study identified a low-cost, high-throughput approach for exploring tissue-specific genes in other crop species.


Introduction
Peanut (Arachis hypogaea L., which is also referred to as groundnut) is one of the most important oil crop species worldwide and plays important roles in human nutrition [1]. Peanut seeds, which are rich in oleic acid, linoleic acid, proteins and other nutrients, are ideal bioreactors for the production of foreign recombinant proteins or other beneficial metabolites.
As important molecular tools, promoters are usually used in gene functional analysis [2][3][4] and are also widely used for plant quality improvement [5][6][7][8]. Seed-specific promoters (SSPs), a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 which can drive the expression of foreign genes specifically in seeds, are of great importance for genetic engineering of seeds. SSPs have been widely applied in plant molecular pharming, such as that involving golden rice [8], purple endosperm rice [9], purple embryo maize [7] and fish oil canola [10]. The use of SSPs can avoid constitutive expression, which can harm plants [11][12][13]. Moreover, repetitive use of the same promoter when expressing multiple foreign proteins simultaneously is considered inadvisable owing to the likelihood of transcriptional silencing [14][15][16]. Therefore, additional peanut SSPs are needed to overexpress or knock down specific genes, regulate seed development, and modify seed content, especially to produce foreign recombinant proteins or secondary metabolites.
To date, few SSPs from peanut are available, and those that are available were identified from known genes expressed specifically in the seed [17][18][19]. Tissue-specific gene expression provides fundamental information for SSP mining. Several methods have been developed to analyze gene expression differences, such as subtractive hybridization [20], suppression subtractive hybridization [21], differential display reverse transcription PCR [22], and cDNA microarrays [23,24]. However, these methods are limited by their specific shortcomings; for example, only known genes can be recognized by microarray chips [23]. With the decreasing cost of transcriptome sequencing, comparative transcriptome sequencing has been widely used to analyze differences in gene expression [25][26][27][28]. The diploid peanut ancestors Arachis duranensis (AA) and Arachis ipaensis (BB) are considered the donors of the A and B subgenomes of the allotetraploid cultivated peanut Arachis hypogaea [1]. The release of A. duranensis and A. ipaensis genome sequences [1] made it convenient to obtain genetic information from cultivated peanut. Comparative transcriptome sequencing combined with peanut genome information is a powerful means of genome-wide mining of SSCGs for promoter cloning.
In this study, we described a genome-wide comparative transcriptome sequencing-based approach to identify SSCGs for SSP cloning in peanut. A total of 337 SSCGs were identified from peanut, and the top 108 SSCGs according to their Fragments Per Kilobase of transcript per Million mapped reads (FPKMs) were characterized. On the basis of semiquantitative RT-PCR analysis, 94 SSCGs were expressed in a seed-specific manner, and 14 SSCGs were expressed in a seed-preferential manner. One novel SSP was cloned and characterized to verify its seed specificity in transgenic Arabidopsis. Our results could be widely used in the identification of future peanut SSPs.

Plant materials and RNA extraction
Plants of the cultivated peanut 'Shitouqi' were grown at the Laixi experimental station of the Shandong Peanut Research Institute during the summer of 2016. Leaves, roots, stems, pegs and pod shells were collected at the pod-maturing stage. Developing seeds were collected between 20 and 80 days after flowering. All tissues were flash frozen in liquid nitrogen and then stored at -80˚C for transcriptome sequencing.
Total RNA was isolated from different tissues using TRIzol (Life Technologies, Carlsbad, CA, USA) reagent. The quality and quantity of each RNA sample were assayed using a Nano-Drop device (Thermo Fisher, MA, USA).

Illumina sequencing and in silico analysis
The RNA extracted from seeds at different development stages was mixed together as Sample I (seed), while the RNA from the leaves, roots, stems, pegs and pod shells were pooled in equimolar amounts as Sample II (nonseed). Both samples were treated and sequenced using an Illumina HiSeq TM 2500 instrument at Gene Denovo Biotechnology Company (Guangzhou, China). Transcript reads containing adaptor sequences were cleaned, and low-quality reads were filtered and removed. The transcript reads of each sample were then mapped to the A. duranensis and A. ipaensis reference genomes [1] by TopHat2 [29].
The gene expression levels were normalized using FPKM methods. To mining SSCGs, the FPKM value of each transcript in Sample I was divided by the value in Sample II using Excel software. The FPKM values of the SSCGs that were less than 10 in Sample I or greater than 10 in Sample II as well as yield values greater than 50 were considered SSCGs. The SSCGs were subsequently listed according to their FPKM value.

Phylogenetic analysis
To study the phylogenetic relationship of the selected SSCGs, multiple alignments of their DNA sequence were performed using the computer program ClustalW. Unrooted phylogenetic trees were constructed in accordance with the neighbor-joining (NJ) method using MEGA 6.0 software, and the bootstrap test was carried out with 1000 iterations.

Expression analysis of SSCGs in A. duranensis and A. ipaensis
The FPKM data of the 108 selected SSCGs within 20 distinct tissues were retrieved from the work of Clevenger et al. [31]. The FPKM normalized read count data of the SSCGs were log2transformed and displayed in the form of heat maps via HemI [32].

Semiquantitative RT-PCR analysis in cultivated peanut
To confirm the tissue expression specificity in cultivated peanut further, RNA extracted from the leaves, roots, stems, pegs, pod shells and seeds were collected at the pod-maturing stage. Three independent RNA preparations were used for semiquantitative RT-PCR. Twenty-six amplification cycles were used to evaluate and quantify the differences among transcript levels. RT-PCR was performed using the peanut Actin gene as an internal control [33]. PCR was performed using 2 � Easy Taq PCR SuperMix (TransGen Biotech, Beijing, China). The PCR conditions were as follows: one initial denaturation step of 94˚C for 3 min; 26 cycles of 94˚C for 30 s, 58˚C for 30 s and 72˚C for 30 s; and one final extension step of 72˚C for 10 min. Three independent RNA preparations were used for semiquantitative RT-PCR. The primers used for these experiments are listed in S3 Table. Isolation of an SSP Peanut genomic DNA was isolated from young leaves of the 'Shitouqi' cultivar using a DNAquick Plant System Kit (Tiangen, Beijing, China). Using AHSSP29-specific primers (S3 Table), we performed PCR with PrimeSTAR GXL DNA Polymerase (Takara, Dalian, China). The PCR products were separated by electrophoresis through a 1.5% agarose gel and purified using a gel extraction kit (TransGen Biotech, Beijing, China). All purified PCR products were subcloned into a pEASY-blunt simple vector (TransGen Biotech, Beijing, China). The DNA sequences were sequenced by the Shanghai Sangon Biotechnology Company (Shanghai, China).
The promoter fragment AHSSP29 of SSCG29 was excised from the pEASY-blunt simple vector with the restriction enzymes HindIII and BamHI (Thermo Fisher, MA, USA) and ligated into the corresponding restriction sites of the plant transformation vector pBI121 to produce an AHSSP29::β-glucuronidase (GUS) construct.

Generation of transgenic Arabidopsis plants
The recombinant binary plasmid was transferred to Agrobacterium tumefaciens strain GV3101, and kanamycin-resistant colonies were selected on medium containing 50 μg ml -1 kanamycin. A selected colony was grown to stationary phase at 28˚C, and the cells were concentrated by centrifugation and then resuspended in a dipping solution that comprised 5% sucrose, 0.03% Silwet-77, and 10 mM MgCl 2 [34]. The seeds were harvested and subsequently stored at room temperature. For screening, the seeds were sterilized in 75% (v/v) ethanol for 3 min and then 2.6% NaClO for 10 min, followed by several washes with sterile water. The transformants were screened on one-half-strength Murashige and Skoog (MS) medium that contained 50 μg ml -1 kanamycin.

Transgene detection in the transgenic progeny of Arabidopsis and GUS histochemical staining
Kanamycin-resistant transgenic Arabidopsis plants were identified using GUS gene-specific primers (S3 Table). The positive transgenic plants were then selfed, after which homozygous T 2 progeny were obtained.

Genome-wide mining of SSCGs via comparative transcriptome sequencing
To mining SSCGs, two samples of the cultivated peanut 'Shitouqi' (Sample I for seed samples and Sample II for nonseed samples) were used for transcriptome sequencing via an Illumina HiSeq TM 2500 system. Approximately 10 Gb of sequence data (approximately 76.79 million reads from Sample I and 78.93 million reads from Sample II, each 300 bp in length) were obtained; after filtering the adaptor sequences and low-quality reads, approximately 75.37 and 77.81 million reads were used for transcriptome assembly, respectively (S1 Table). All of the reconstructed genes were aligned to the reference genome of A. duranensis and A. ipaensis [1] and were subsequently annotated. A comparative transcript profile was established based on the FPKM values of the assembly transcripts. Three hundred thirty-seven SSCGs were ultimately identified and designated sequentially as SSCG1 to SSCG337 according to their FPKM value. The detailed information of these SSCGs, including their gene symbol, chromosomal location, FPKM value and putative function(s), is listed in Table 1 and S2 Table. GO annotation was performed using BLAST2GO, and the 337 SSCGs were categorized with particular GO annotations (S1 Fig, Table 2). Expectedly, these SSCGs were enriched in metabolic process (120) and catalytic activity (108) GO terms, which suggested the presence of vigorous metabolic activity in the seed, in which fatty acids such as oleic acid are converted into linoleic acid by fatty acid desaturase [36]. To identify promoters that are strongly or specifically expressed in the seed, the most abundant top 108 SSCGs were chosen for further analysis. With the decreasing cost of transcriptome sequencing and the release of the peanut ancestor genome, comparative transcriptome sequencing has become an efficient approach for mining tissuespecific genes from peanut and other less studied crop species.

Characterization of the top 108 SSCGs from A. duranensis and A. ipaensis
SSPs are usually isolated from seed storage proteins and/or other proteins related to seed development, such as Brassica napus Napin, which was isolated from a 2S storage protein [37], indicating that gene characterization may reflect the specificity of its promoter. To predict the activity of their promoters, we therefore characterized the 108 SSCGs. Among the top 108 SSCGs, 96 had putative functions, and 12 had unknown functions. The 96 SSCGs were classified into 14 groups according to their annotations, and 54 of those SSCGs were involved in lipid metabolism and seed maturation or coded for nutrient reservoir proteins, allergens, and seed storage proteins (Fig 1C), which revealed that these top 108 SSCGs might perform functions within peanut seeds. As shown in Fig 1A, SSCGs were randomly dispersed across 10 chromosomes. In A. duranensis, chromosome A6 contained the greatest number of SSCGs (15), while chromosome A4 contained the fewest SSCGs (1). In A. ipaensis, 13 SSCGs were distributed on chromosome B6, whereas only 3 SSCGs were found on chromosomes B1 and B3 (Fig 1B). Several SSCGs were located on the chromosomes in clusters; for example, 6 SSCGs (SSCG2, SSCG3, SSCG7, SSCG36, SSCG42, SSCG75) were within the 1.26-1.8 cM region on chromosome A6 ( Fig 1A); functional prediction revealed that these SSCGs encoded nutrient reservoir proteins (Table 1). SSCG14 and SSCG23, both of which coded for seed linoleate 9S-lipoxygenase, were located at Peanut seed-specific candidate genes mining the same locus of chromosome B8. These results suggested that these clustered genes might function together in coordination.
In this study, we identified 39 orthologous gene pairs between A. duranensis and A. ipaensis based on phylogenetic relationships (S2 Fig, Table 3), among which 36 orthologous gene pairs were found at the syntenic locus on the A. duranensis and A. ipaensis chromosomes (Fig 1A,  Table 3). The orthologous genes from A. duranensis and A. ipaensis exhibited similar functions; for example, both SSCG63 (A9) and SSCG103 (B9) encode the AWPM-19-like family protein, and both SSCG87 (B9) and SSCG100 (A9) encode the papain family cysteine protease (Tables 1 and 3). Although the sequences of some orthologous gene pairs are highly similar, their promoter sequences were sometimes quite different. For example, SSCG43 (Araip.213GN) and SSCG94 (Aradu.440M4) had the same sequence, but their promoter sequences were quite different. Whether the promoters of orthologous gene pairs displayed the same specificity needs to be further determined. The location of 2 SSCGs in the A genome (SSCG21 and SSCG93) did not correspond to the same location of their orthologous genes in the B genome (SSCG12 and SSCG89). Interestingly, SSCG53, located on chromosome B7, had the same sequence as its orthologous gene, SSCG54, on chromosome B10.

Expression patterns of the top 108 SSCGs
To confirm the tissue expression specificity of the top 108 SSCGs, we first analyzed the expression profiles using the expression information provided by Clevenger et al. [31]. The heat map results showed that all the top 108 genes were expressed in the seed; most were expressed only in the seed, whereas the rest were preferentially expressed in the seed (Fig 2). The expression patterns of the orthologous genes from the A and B genomes were similar. For example, SSCG12 and SSCG21 were highly expressed during the Pt6, Pt7, Pt8 and Pt10 seed stages but weakly expressed in other tissues, such as mainstem leaves, the reproductive shoot tip, nodule roots, stamens and the aerial gynophore tip. SSCG78 was expressed in the early seed development stage (SeedPt5-7), while SSCG106 was expressed in the late seed development stage (SeedPt7, 8, 10). Their promoters could be used to express genes at different seed development stages. Notably, SSCG1-12 was extremely highly expressed in the seeds, and specifically, SSCG1 and SSCG6 were abundantly expressed during all five seed development stages (Fig 2). Functional prediction analysis revealed that these SSCGs encoded nutrient reservoir proteins or allergen proteins (Table 1), whose transcripts are considered widely expressed specifically in mature peanut seed [38,39]. We further examined the tissue expression specificity of the SSCGs in cultivated peanut via semiquantitative RT-PCR. Because the orthologous gene pairs had similar sequences, they were considered a single gene, and to investigate their expression patterns, primers were designed based on their same sequence. As shown in Fig 3, similar to the heat map results, most of these 108 SSCGs were expressed specifically and/or preferentially in the seed. Ninetyfour out of the 108 SSCGs were expressed exclusively in the seed, accounting for 87%. Only a few SSCGs (SSCG13, 25, 41, 44, 51, 52, 58, 70, 75, 83, 84, 86, 88, 98) were also weakly expressed in other tissues, such as the roots, stems, pegs, pod shells and leaves. Peanut seed-specific candidate genes mining Overall, based on the expression pattern analysis above, the SSCGs described in this study are potential resources for seed-specific and/or preferential promoter cloning.

Cis-acting elements in the promoter regions of the top 108 SSCGs
Gene expression specificity was mediated by cis-elements in the promoter region [40,41]. To identify the regulatory cis-elements in the promoter region of SSCGs, we extracted the 2500 bp promoter sequence upstream of the start codon of the top 108 SSCGs. The results showed that there were 92 promoters containing RY REPEAT motifs and 33 promoters containing GCN4 motifs. Thirty-seven promoters contained more than three RY REPEAT motifs, and there were five motifs in SSCG28 (Aradu.DWL7L) and SSCG99 (Aradu.UJ6Z9) and six in SSCG74 (Aradu.9S6MI). Twenty-nine promoter sequences contained both motifs ( Table 4). The RY REPEAT (CATGCA) [42] and GCN4 (TGAGTCA) [43,44] motifs are commonly located within seed-and/or embryo-specific promoter sequences. These results implied that most of the promoters of the top 108 SSCGs were seed specific.

Characterization of an SSP
To verify promoter tissue specificity, we isolated a 2771 bp promoter fragment (Arachis Hypogaea Seed-Specific Promoter 29, AHSSP29) from the cultivated cultivar peanut 'Shitouqi' according to the reference sequence of SSCG29 (Aradu.YC8MH) in its ancestor A. duranensis. SSCG29 encodes a vicilin-like seed storage protein. Several cis-acting elements, including one GCN4 motif [43,44], two RY REPEATs [42], and three 2SSEEDPROTBANAPAs [45], which commonly exist in SSPs, were detected in the AHSSP29 sequence (Table 5). AHSSP29 was then substituted with the CamV35S promoter in a pBI121 vector to produce a AHSSP29::GUS construct, which was subsequently transformed into Arabidopsis. GUS histochemical assays revealed GUS staining in all parts of the seed (Fig 4A-4C), with the exception of the seed testa. GUS staining was hard to observe in seed wrapped in a testa (Fig 4A), while GUS activity was clearly visible in the germinating seed that lacked a testa (Fig 4B and 4C). Definitive staining was also observed in the cotyledons and hypocotyls of the seedlings (Fig 4D), which are components of the seed. No GUS activity was detected in the leaves, stems, flowers, roots and siliques at any time during the plant life cycle (Fig 4E-4H). Nontransformed Arabidopsis plants did not display GUS activity in their mature seeds or any parts of the plants. These results suggested that the AHSSP29 promoter was an SSP.

Discussion
SSPs are valuable tools for the genetic engineering of seed, especially for seed bioreactor research. Peanut seeds are ideal bioreactors for the production of foreign recombinant proteins and other nutrient metabolites. However, only a few seed-specific and/or seed-preferential promoters have been identified from peanut [17][18][19]46]. Expressing multiple foreign genes using the same promoters is ill advised [14][15][16]. Therefore, additional SSPs are urgently needed. In this study, we established an effective method for the genome-scale mining of SSCGs via comparative transcriptome sequencing of a mixture of nonseed tissue and seed tissue. A total of 337 SSCGs were identified, and 108 SSCGs in A. duranensis and A. ipaensis were further characterized. At least 94 SSCGs were confirmed via semiquantitative RT-PCR to be expressed specifically in the seed in cultivated peanut, and the rest were preferentially expressed in the seed. This study provided a valuable resource for seed-specific and/or seedpreferential promoter cloning.
Among the 108 identified SSCGs, most functioned in relation to seed development or coded for allergen proteins or storage proteins (Fig 1C, Table 1). For example, SSCG1-7 and SSCG9, which encoded allergen proteins, were homologous genes and were extremely highly expressed according to their FPKM values (Table 1), heat map results (Fig 2) and Peanut seed-specific candidate genes mining  semiquantitative RT-PCR analysis (Fig 3). Peanut allergen proteins were reported to be expressed exclusively in the seed [39] and accounted for a considerable amount of the total seed protein in peanut [47]. This finding is in accordance with the abundant expression of SSCG1-7 and SSCG9 in the peanut seed. These results indicated that these SSCGs were expressed specifically in the seed, and these SSCGs that were most abundantly expressed were the focus of our subsequent promoter cloning. Studies have shown that several cis-acting elements in promoter sequences are responsible for mediating gene expression specificity. For example, the cis-acting elements RY REPEAT and GCN4 are conserved among many SSPs [42,43]. These cis-acting elements were also present throughout most of the SSCGs in this study, which implied that the promoters in most of the SSCGs might drive gene expression in a seed-specific manner. Several promoters of these SSCGs have been characterized as SSPs. For example, the promoter of an SSCG5 paralogous gene, which encodes an allergen protein, was isolated and characterized as an SSP [19]. Together with the novel SSP AHSSP29 of SSCG29 (Aradu.YC8MH) identified in this study, which contained 2 RY REPEAT and 1 GCN4 elements, the results indicate that the SSCG mining strategy in this study seemed effective and accurate. Once these promoters are isolated and characterized, they could be widely used for allergen reduction via gene editing technologies and for other research on seed quality improvement.
Geng et al. [48] introduced a method for tissue-specific promoter cloning by comparing expression levels among three tissues: leaves, roots, and seeds. A total of 316 seed-specific candidate transcript assembly contigs (TACs) were identified. In addition, 64.6% of select TACs were expressed exclusively in the seed and not in the leaves, stems, or roots [48]. However, to date, no SSPs have been identified based on these data, which may be attributed to insufficient transcriptome data and the lack of reference genome information. In our study, only two samples were chosen for transcriptome sequencing: seeds from different development stages and a mixture of nonseed tissue from six tissues (including roots, stems, leaves, flowers, pegs, and pod shells). It is much less expensive to sequence the transcriptome of nonseed tissue mixtures than to sequence each individual tissue. Moreover, it becomes simpler and more accurate to screen SSCGs by comparing two samples rather than by comparing numerous samples. Consequently, 337 SSCGs were identified, and 87% of the top 108 SSCGs were expressed Peanut seed-specific candidate genes mining exclusively in the seed and not in the five measured tissues (roots, stems, leaves, pegs, and pod shells). These results indicated that additional tissues were necessary as part of the nonseed sample to compare gene expression differences with seed samples. This SSCG information, such as the gene symbols, can be obtained conveniently from Table 1 and S2 Table. Researchers could easily download SSCGs of interest from the PeanutBase website according to this information. With the decreasing transcriptome sequencing cost and the release of the peanut genome, mining tissue-specific genes from peanut via comparative transcriptome sequencing has become a robust approach. For example, contamination with aflatoxin, which is produced in infected peanut seeds by Aspergillus flavus, is one of the major problems in peanut Table 5. Putative cis-acting elements in the AHSSP29 promoter sequence.  Peanut seed-specific candidate genes mining production. Given that peanut pericarps are barriers against A. flavus, pericarp-specific promoters are a good choice for expressing A. flavus-resistant genes specifically in the pericarp to prevent aflatoxin contamination. Pericarp-specific promoters could be identified by the strategy presented in this study.

Conclusions
We identified 337 SSCGs by comparative RNA sequencing (RNA-seq) between seed and nonseed tissues. The top 108 SSCGs, according to their FPKM, were characterized, among which 94 were expressed specifically in the seed, and 14 were preferentially expressed in the seed. In addition, a novel SSP, AHSSP29, was functionally characterized. The strategy presented in this study could facilitate the future exploration of tissue-specific promoters in other crop species. Additionally, the SSCGs identified in this work could be widely applied for SSP cloning by other researchers.