De Novo Regulatory Motif Discovery Identifies Significant Motifs in Promoters of Five Classes of Plant Dehydrin Genes

Plants accumulate dehydrins in response to osmotic stresses. Dehydrins are divided into five different classes, which are thought to be regulated in different manners. To better understand differences in transcriptional regulation of the five dehydrin classes, de novo motif discovery was performed on 350 dehydrin promoter sequences from a total of 51 plant genomes. Overrepresented motifs were identified in the promoters of five dehydrin classes. The Kn dehydrin promoters contain motifs linked with meristem specific expression, as well as motifs linked with cold/dehydration and abscisic acid response. KS dehydrin promoters contain a motif with a GATA core. SKn and YnSKn dehydrin promoters contain motifs that match elements connected with cold/dehydration, abscisic acid and light response. YnKn dehydrin promoters contain motifs that match abscisic acid and light response elements, but not cold/dehydration response elements. Conserved promoter motifs are present in the dehydrin classes and across different plant lineages, indicating that dehydrin gene regulation is likely also conserved.


Identification of dehydrin genes
A custom solution was used to identify all dehydrin genes found in the plant genomes described above. Amino acid sequences of several known dehydrins were obtained from NCBI GenBank [72] and sequences of their K-segment were used to populate a seed FASTA file. A Python script was written that used Biopython [73] Motif module to scan all amino acid sequence for proteins containing a sequence similar to the K-segment, based on its position frequency matrix (PFM). After each round of search new K-segment sequences were added to the original FASTA file. The Y-segment sequence file was constructed in a similar manner using identified dehydrin protein sequences. Identified dehydrins were categorized based on the occurrence of conserved segments using either their PFMs (K-and Y-segments) or a regular expression that described a simpler S-segment. All identified dehydrins were divided into five categories: K n , KS, SK n , Y n K n , Y n SK n and 1000 bps upstream of the transcription start site (where data was available, otherwise upstream from the start site) were obtained from Phytozome BioMart or they were directly extracted from the genomes using custom scripts. Oxytropis arctobia and Oxytropis splendens KS dehydrin gene sequences were obtained from NCBI GenBank (accessions: AEV59613 and AEV59617, respectively [6]). 1000 bp of O. arctobia and O. splendens promoters were obtained by amplifying GenomeWalker libraries and sequencing PCR products (Zolotarov et. al., unpublished).
To validate that the identified genes can actually be considered dehydrins, phmmer, as implemented on the HMMER web server [74] was used to search sequences on UniProt Knowledgebase [75] that have a significant similarity to putative dehydrins discovered using our custom method. The top ten significant hits were taken for each putative dehydrin and their domain annotation was extracted. Additionally, Needleman-Wunsch [76] pairwise alignment was used to compare 15 putative KS dehydrin to a known Arabidopsis KS dehydrin (AT1G54410). The closest match to every putative dehydrin in the NCBI GenBank nonredundant database was searched for using BLAST [77].
All the scripts and sequence data used in this paper are available from https://github.com/ zolotarov/dehydrin_promoters

Intrinsic disorder and hydrophilicity analysis
The identified dehydrin sequences were compared for intrinsic disorder and hydrophilicity with random plant protein sequences to assess the classification as a dehydrin. To calculate the grand average of hydrophilicity, Biopython ProtParam module was used. To calculate disorder proportion, IUPred [78] scores were calculated for each amino acid. The proportion of amino acids with the score above 0.5 (indicating disorder) was calculated. Statistical comparison was performed using t-test implemented in the scipy library [79,80]. The same number of random protein sequences was obtained for each species as the number of dehydrins used in this study. The sequences were downloaded using NCBI Entrez Direct E-utilities [81].

De novo motif discovery
Motifs were discovered using MEME v4.9.1 [82], Seeder v0.01 [83] and Weeder v1.4.2 [84], and using the five sets of sequences as separate input. Significant motifs were selected based on following parameters: E-value 0.05 for MEME, Q-value 0.01 for Seeder and the top 3 motifs recommended by Weeder adviser. All promoters that were available through Phytozome BioMart from all species included in the analyses, was used as a background set (a total of 1029220 promoters). A separate parser was written to extract significant PFMs from result files produced by each program. The PFMs produced for each dehydrin class were entered into the STAMP [85] website to group matrices by similarity and to identify significant (Evalue 0.05) matches in PLACE [23]. A representative member from a tree node of matrices grouped by similarity was selected and its sequence logo was generated using WebLogo 3.3 [86].

Results and Discussion
In order to further understand how different dehydrins are regulated in response to environmental stress, motifs corresponding to conserved cis-regulatory elements were detected in the upstream regions of dehydrin genes in all five subclasses. Dehydrin proteins are by nature unstructured, and a custom identification strategy was employed to retrieve as many dehydrin genes with up to 1000 bp upstream region as possible. In total, 340 dehydrin promoters of size 1000 bp and eight dehydrin promoters of shorter length were retrieved from 51 plant genome sequences. In addition, two promoters from dehydrin genes isolated from two Oxytropis species were also included (Table 1, S1 Table). Out of the queried genomes, 10 were from monocotyledonous plants, 37 from dicotyledonous plants, one from a basal angiosperm (Amborella trichopoda), two from gymnosperms (Picea abies and Pinus taeda) and one from moss (Physcomitrella patens). The 350 sequences identified were confirmed to also match annotated dehydrins. For 330 out of 350 sequences, at least half of the top ten significant hits had "Dehydrin" as domain annotation. For the remaining 20 putative dehydrins, less than half of the top ten significant hits carried that annotation. Out of those, three were annotated as either a dehydrin or similar to dehydrin on NCBI GenBank and two were annotated as having a dehydrin Amborella trichopoda 0 Arabidopsis halleri 0 0 3 1 2 6 Arabidopsis lyrata domain on UniProtKB. The rest of the sequences were all short putative KS dehydrins. In these cases, all significant phmmer hits were analyzed. From 14.2% to 30.1% of significant hits had "Dehydrin" domain architecture, for sequences with lowest and highest number of significant hits with "Dehydrin" annotation, respectively. The rest of significant hits had no architecture annotation. When Needleman-Wunsch pairwise alignment was used to compare 15 putative KS dehydrin to a known Arabidopsis KS dehydrin (AT1G54410), sequence similarities ranged from 59.0% to 98.0%. This evidence supports the notion that the sequences extracted for the analyses can be classified as dehydrins.

Biochemical properties of dehydrins
Dehydrins are known to be intrinsically disordered and hydrophilic [87], making it difficult, if not impossible, to identify them by overall sequence homology. These properties are, however, important for their hypothesized function in protein stabilization through interaction with water molecules, as well as for their subcellular location in the cytosol and the nucleus and not within membranes [87,88]. To assess the identification of the dehydrins included in this study, the grand average of hydropathicity (GRAVY, [89]) and the proportions of amino acids in the disordered regions were compared between dehydrins and random plant proteins. It was found that the 350 dehydrin amino acid sequences analyzed, were significantly more hydrophilic than 350 random plant protein sequences (GRAVY -1.3470 for dehydrins, -0.2938 for random plant proteins, p-value < 0.001). The level of structural disorder indicated that in the dehydrins analyzed, the average proportion of amino acid sequences in the state of disorder was 99.32% compared to 15.95% in random plant proteins (p-value < 0.001).

Promoters of KS dehydrins have one conserved GATA motif
In total, 47 KS dehydrin promoters were included in the de novo motif discovery (Table 1). Using the de novo motif discovery tool Seeder [83,90], one single putative conserved regulatory motif was discovered in all 47 promoter sequences (Motif 1, Table 2, S2 Table). A similar motif was also discovered with Weeder [84]. The KS dehydrins are known to be expressed in response to cold and dehydration, as well as being constitutively expressed [6,[91][92][93]. Although the single identified overrepresented motif in KS dehydrin promoters does not directly match any typical cold or dehydration-related cis-regulatory elements in the PLACE database [23], it does match two motifs involved in light regulation and one involved in sugar regulation ( Table 2): IBOXCORENT (I-box core) [94], REBETALGLHCB21 [95] and SREATMSD [96], respectively. These three experimentally validated motifs share four nucleotides (GATA). Oxytropis splendens* - Total 39 47 120 21 123 350 * Only one promoter was obtained per species, using genome walking.
One of these motifs, the I-box (GATAAGR) can form a light-responsive conserved DNA modular array (CMA) together with a G-box (CACGTGGC) when located in close proximity to one another. In transgenic Arabidopsis and tobacco (Nicotiana tabacum) plants, the presence of this CMA in a promoter, drives GUS reporter gene expression when exposed to light. Interestingly, this expression seems to be mediated by phytochrome and cryptochrome photoreceptors [94].
Another of the motifs matching the motif discovered in the KS dehydrin promoters, the REBETALGLHCB21, also called REβ (CGGATA), was first identified in gibbous duckweed (Lemna gibba) [95]. It is involved in phytochrome-mediated repression of promoter activity in darkness, when located in close proximity with REα (AACCAA). Although REα was not identified as a significantly overrepresented motif, it is found in 26 out of the 47 KS dehydrin promoters analyzed. The GATA part of the REβ was shown to be absolutely necessary for darkness-induced repression [95]. Furthermore, in Arabidopsis, C-repeat (CCGAC, CRT)linked cold and dehydration induced gene expression is mediated by phytochrome [25]. While CRT was not found to be significantly overrepresented within the set of KS dehydrin promoters, it is noteworthy that 27 out of the 47 KS dehydrin promoters contain one or more copies of CRT or its reverse complement. Sixteen of the promoters contain both REα and REβ.
The motif discovered in the KS dehydrin promoters also matched a sugar-repressive element, SREATMSD (TTATCC, SRE), shown to be involved in sugar mediated gene repression in Arabidopsis [96]. Sugars are known osmoprotectants that are produced by plants in response to cold [97]. One of the suggested roles of dehydrins is in the stabilization of protein conformation. Sugars, such as sucrose and trehalose, can replace water molecules on the surface of a protein and can thus conserve its conformation. This allows cells to restore their function after rehydration [98].
Motifs discovered in promoters of K n match abscisic acid and low temperature response elements A total of 39 K n dehydrin promoters were included in the de novo regulatory motif discovery analysis ( Table 1). The K n dehydrins are expressed in response to high salinity, abscisic acid (ABA), cold and dehydration [3,5,99,100]. A total of three putative regulatory motifs were identified in this set of promoters (Table 3, S2 Table)-two were discovered using MEME Number of the motif and the de novo discovery software that was used to locate that motif. 2 Motif consensus sequence in IUPAC nucleotide code. 3 Occurrence is the number of promoters containing a de novo motif out of the total number of promoters analyzed for a specific dehydrin class, presented in the parentheses. 4 Siginificance of the motif, E-value calculated by MEME, Q-value calculated by Seeder, presented in the parentheses. 5 PLACE matches were identified using STAMP, only significant matches with E-value < 0.05 are presented. 6 E-value of the match with PLACE motif.  [102] and LTREATLTI78 (ACCGACA) [103], two low temperature response elements (LTREs) involved in cold response in A. thaliana. Additionally, Motif 3 matches an ABRE found in wheat and rice-ABREOSRAB21 (ACGTSSSC) [104]. The presence of both LTREs and an ABRE indicates that K n dehydrins, similarly to SK n and Y n SK n dehydrins, could be expressed in ABA-dependent and independent manner in response to osmotic stresses.

SK n dehydrins contain multiple cold/dehydration, abscisic acid and light regulated response elements
A total of 120 SK n dehydrin promoters were analyzed (Table 1). Six de novo discovered putative regulatory motifs are presented in Table 4 and S2 Table. MEME and Seeder each discovered three motifs. The SK n dehydrins are known to be expressed in response to cold, ABA, dehydration and salt [3,5,14,105]. Three out of six motifs (motifs 5-7) have matches in PLACE that are known ABREs. Motif 5 (CCACGTGTC/GACACGTGG) matches ABREs from wheat (Triticum aestivum) [103] and canola (Brassica napus) [106]. Motif 6 (CCGACGCG/ CGCGTCGG) matches ABREs from maize [107], and rice [108]. Motif 7 (CCAACGCG/ CGCGTTGG) matches an ABRE from barley [109] and rice [107]. Motifs 6, 8 (CACCGACC/ GGTCGGTG) and 9 (TGGTCGGT/ACCGACCA) match low temperature response elements known as C-repeats (CRT, consensus sequence: RCCGAC), found in numerous species [110][111][112]. Number of the motif and the de novo discovery software that was used to locate that motif. 2 Motif consensus sequence in IUPAC nucleotide code. 3 Occurrence is the number of promoters containing a de novo motif out of the total number of promoters analyzed for a specific dehydrin class, presented in the parentheses.
In addition, motif 5 matches an element from tomato and Arabidopsis light-regulated genes [113,114] and motif 10 (GTGGGACC) matches an element from pea involved in light-induced repression [115].
The presence of these significantly overrepresented motifs indicates that the SK n dehydrins are regulated at the transcriptional level and their expression is modulated in response to cold and ABA. SK n dehydrins should also be expressed in response to drought, since CRT, which is also called dehydration responsive element [116], is found in their promoters. The circadian clock controls cold induction of C-repeat binding factors (CBFs), which in turn bind CRT/ DRE elements [117]. Phytochrome and cryptochrome genes are also regulated by a circadian clock in Arabidopsis [118]. COR27, a cold-induced gene, is regulated by circadian clock related evening elements (EE) [119]. In addition to EE, the COR27 promoter also contains multiple ABREs and G-boxes, to which motifs 5 and 6 also match. The core EE (AATATCT) [120] is Number of the motif and the de novo discovery software that was used to locate that motif. 2 Motif consensus sequence in IUPAC nucleotide code. 3 Occurrence is the number of promoters containing a de novo motif out of the total number of promoters analyzed for a specific dehydrin class, presented in the parentheses. 4 Siginificance of the motif, E-value calculated by MEME, Q-value calculated by Seeder, presented in the parentheses. 5 PLACE matches were identified using STAMP, only significant matches with E-value < 0.05 are presented. 6 E-value of the match with PLACE motif. doi:10.1371/journal.pone.0129016.t004 found in 18 out of 73 SK n gene promoters analyzed. Motifs involved in light-induced regulation of gene expression found in the promoters of SK n genes could participate in modulation of these genes by the circadian clock. It has been shown previously, using bioinformatics methods, that the promoters of cold-regulated genes contain CRTs and ABREs [112,121] and our data also support those findings. Motifs 5 and 10 match an auxin response element found in soybean GmAux28 [122] and SAUR15A promoter, respectively [123]. It has been shown previously that numerous genes related to auxin response in Arabidopsis are modulated in response to cold, such as auxin response factor ARF7 or the PINOID-binding protein 1 that is involved in hormone signaling and stress response [124]. Y n SK n dehydrins promoters contain multiple ABREs, light REs and a CRT Y n SK n dehydrins represent the largest subclass out of the five dehydrin classes analyzed. A total of 123 Y n SK n gene promoters were analyzed ( Table 1). The Y n SK n dehydrins are expressed in response to ABA, dehydration and high salinity [3,5,100]. The two motifs presented in Table 5 and S2 Table (Motifs 11 and 12) match numerous elements in the PLACE database and both were discovered using a Seeder. Motif 11 (GACACGTGGC) is very similar to Motif 5 (GACACGTGT), found in the SK n dehydrin promoters and they both match several of the same motifs in PLACE database, namely ABREs, G-box and light response elements. Motif 13 (CACCGAC) is almost identical to Motif 8 (CACCGACC) discovered in the SK n dehydrin promoters, which matches CRT/DRE necessary for CBF mediated cold and dehydration response [116]. Overall, motifs found Y n SK n dehydrin promoters are very similar to those found in SK n dehydrin promoters indicating that they possibly have a similar function, and that these two classes may have diverged more recently than the other classes. While the function of the Ysegment in the gene products of Y n SK n and Y n K n dehydrins is not known, it shows similarity Number of the motif and the de novo discovery software that was used to locate that motif. 2 Motif consensus sequence in IUPAC nucleotide code. 3 Occurrence is the number of promoters containing a de novo motif out of the total number of promoters analyzed for a specific dehydrin class, presented in the parentheses. 4 Siginificance of the motif, E-value calculated by MEME, Q-value calculated by Seeder, presented in the parentheses. 5 PLACE matches were identified using STAMP, only significant matches with E-value < 0.05 are presented.
to the nucleotide binding domain of plant chaperones [125]. The gene products of the other dehydrin classes do not have any such domains. In addition, there are evolutive constraints on the Y-segment in a dehydrin from arctic Oxytropis species compared with temperate species [126], suggesting that the Y-segment might carry an important function that differentiates Y n SK n from SK n dehydrins. Some of the published data shows that Y n SK n dehydrins are not expressed in response to cold [3,5], however there is evidence that after a period of acclimation they do accumulate in Red-Osier Dogwood (Cornus sericea L.) [127] and apple trees [128]. It is possible that cold-induced Y n SK n dehydrin expression was not detected in some data sets due to a limited time of exposure to low temperature.
Y n K n dehydrins promoters contain ABREs and light REs Y n K n dehydrins represent the smallest subgroup, with only 21 members found in 51 plant genomes (Table 1). Y n K n dehydrins are known to be expressed in response to cold [129,130], and two motifs were detected in their promoters ( Table 6, S2 Table). One was identified using Seeder and the other using MEME. Both motifs match several ABREs and light REs in the PLACE database. Motif 13 (TAACACGTGTC/GACACGTGTTA) is similar to motif 11 (GACACGTGGC/ GCCACGTGTC) identified in Y n SK n dehydrins and it matches the some of the same motifs in PLACE. Motif 14 (ACGTGGCA/TGCCACGT) is similar to motif 11 (GACACGTGGC) found in Y n SK n dehydrin. The lack of CRTs in the promoters of Y n K n dehydrins suggests that they might be expressed in response to cold in ABA-dependent manner, not linked with the CBF transcription factors [18]. Number of the motif and the de novo discovery software that was used to locate that motif. 2 Motif consensus sequence in IUPAC nucleotide code. 3 Occurrence is the number of promoters containing a de novo motif out of the total number of promoters analyzed for a specific dehydrin class, presented in the parentheses. 4 Siginificance of the motif, E-value calculated by MEME, Q-value calculated by Seeder, presented in the parentheses. 5 PLACE matches were identified using STAMP, only significant matches with E-value < 0.05 are presented. 6 E-value of the match with PLACE motif. doi:10.1371/journal.pone.0129016.t006

Conclusions
Numerous dehydrins were identified in 51 plant genomes, many of which are not found in protein databases such as InterPro or PROSITE, or they are not annotated in Phytozome. The identified dehydrins were categorized into five subclasses based on the occurrence of conserved protein segments. Three de novo motif discovery software tools were used to find statistically significant overrepresented motifs in the promoters of each group of dehydrins. These motifs were matched to known cis-regulatory elements in the PLACE database to help explain the regulation of dehydrin expression in response to different environmental stimuli. Dehydrins are expressed in response to multiple stress stimuli. Although there is overlap in expression triggers between dehydrin subclasses, there are differences in the pattern of expression. Some of the dehydrins are expressed constitutively in all tissues [3,5] and more specifically in seeds [131,132]. The presence of ABREs, CRTs and light REs in the promoters of Y n SK n and SK n dehydrins indicates that they could be expressed in response to dehydration and cold in both ABA-dependent and independent pathway and that this expression is modulated by light.
While Y n SK n and SK n dehydrin are found in most species, often in several copies, the other three subclasses are encountered less often. It is probable that they either have specialized functions or they are expressed together with Y n SK n and SK n dehydrins to increase the overall protective effect against dehydration. It is important to note that the number of discovered dehydrins is probably an underestimation due to incompleteness of genome assembly and errors inherent in sequencing.
Dehydrins play an important role in the survival of plants facing various stresses. Motifs matching cis-regulatory elements linked to both ABA-dependent and independent stress response pathways, as well as light response pathways were detected in dehydrins from many different plant families. The implication of this finding is that the regulation of dehydrins is conserved in the plant lineages included in this study and that stress-linked selection pressure preserved cis-regulatory elements in the promoters of dehydrins through stabilizing selection.
Supporting Information S1 Table. Annotation and meta-data about the dehydrins included in the study. Each identified dehydrin was further analyzed by BLAST to find the closest match at NCBI GenBank non-reduntant database. The fields are 1. Species; 2. Gene; 3. Dehydrin subgroup; 4. BLAST top hit e-value; 5. BLAST top hit accession; 6. BLAST top hit description; 7. K-segment location; 8. Y-segment location; 9. S-segment location. (CSV) S2 Table. Motif logos of motifs discovered in dehydrin promoters. Weblogos were made for each of the motifs identified by de novo motif discovery algorithms in five classes of dehydrin genes. The motif numbers correspond to the motif numbers in Tables 2-6.  (DOCX)