Nucleotide patterns aiding in prediction of eukaryotic promoters

Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 “promoter-specific” transcription factors), those that bind preferentially to the [0,500] region (282 “5′ UTR-specific” TFs), and 207 of the “promiscuous” transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.


Introduction
Core promoters are the 5' regions adjacent to the transcriptional start site (TSS) and containing binding sites for transcription factors (TFBS). Computational analysis of the eukaryotic promoters is hindered by their complex architecture [1][2][3]. Each gene contains one or more TSS, and, respectively, one or more promoters, which initiate transcription of a gene. Depending on species, from 30% to 60% of eukaryotic genes contain the TATA motif approximately genetic variants across the Oryza sativa cultivars [28]. Observed clusters of reduced nucleotide variability were shown to highlight functionally important genomic regions. Interestingly, a sharp decline in SNP density was noted about 250 nucleotides upstream of TSS elements; this decline reaches its minimum exactly at the TSS.
In plant genes with multiple promoters, precise mapping of TSSs requires incorporation of diverse data types including tissue/stress specificity of each transcript. Unfortunately, most of currently available techniques cannot incorporate a variety of available data, and also must ignore alternative promoters. Therefore, accurate identification of TSS and core promoter regions remains an open problem. Since evidences for location of TSS are imprecise, the best approach for promoter production should embed probabilistic integrative algorithms. In this paper, we present a comprehensive analysis of genomic features associated with the promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into "promoters" and "non-promoters" even in absence of full-length cDNA sequences. These models may be built upon the maps of the distributions of SNPs, RNA sequencing reads on genomic DNA, methylated nucleotides, TFBS as well as relative frequencies of nucleotides and their combinations.

Selection of the "gold standard" gene prediction models
To aid a selection of the best available rice genome annotation, Fgenesh and MSU mRNAbased gene prediction models were compared. Fgenesh gene prediction set contains 18,389 high quality (5 0 full, with mRNA support) gene models, while the MSU gene prediction set contains 20,367 high quality gene models [28,29]. For every gene in both models, we extracted a 1,000 nt long sequence centered at the TSS, and calculated distributions of genomic features previously associated with the start of transcription: (1) frequency of dinucleotide CA [1,30,31]; (2) frequency of TATA [1,4,32]; (3) nucleotide consensus around TSS [12,13,33]; (4) CG skew (CG skew ¼ #CÀ #G #Cþ#G where #C and #G refers to the counts of nucleotides C and G in a certain genomic window) [34]. Fig 2C and 2D shows that Fgenesh-annotated promoters have a more pronounced nucleotide consensus as compared to the promoters annotated by MSU. Fgenesh promoters also have higher frequency of the exact TATA motif at -30 (B), and more CA dinucleotides at the position of TSS (A) . Fig 2(F) shows peak of the CG skew at TSS, calculated in the window of 40 nt both annotations; Fgenesh-annotated CG skew peak is higher than the MSU one. Based on the assumption that these features reliable reflect the quality of promoter annotation, for further analysis the Fgenesh model was selected.

Distribution of transcription factor binding sites
The distributions of the transcription factor binding sites (TFBS) in promoters and the UTRs of high-confidence rice genes in the regions of -1000 +1000 around TSS were investigated with MATCH algorithm [35] incorporated in geneXplain platform (www.genexplain.com). MATCH uses the TRANSFAC database [36] comprising 764 plant position weight matrices (PWM) with a strict similarity score threshold of 0.95. MATCH scans the targets promoter sequences with a sliding window equal to the length of the PWM and calculates a score for each of the windows. The maximum value of the score (1.0) corresponds to the sequence that fully fits the consensus of the given PWM. Score threshold of 0.95 allows very little mismatches to the consensus, with few permitted mismatches limited to less conserved positions. In addition, the MATCH score considers the nucleotide position-specific entropy measures. In a recent study, MATCH performed with accuracy superior to other motif-finding algorithms [37]. In the Fgenesh-predicted rice promoters, MATCH search against the TRASNFAC database  resulted in mapping of 3.2 million potential TFBS corresponding to 667 plant PWMs, while 97 PWMs remained matchless, possibly due to their exclusive role in the dicots or to the binding to distal promoters not analyzed in the present study (see S2 Table). Interestingly, 487 out of 667 TFBS (73%) were found in proximal promoters of Oryza sativa more than 1000 times; the most frequent sites were that for the transcription factors ASR1, DOF56 and PBF. When the frequencies of TFBS found in the proximal promoters were compared with the frequencies for the same PWMs found in randomly shuffled sequences, the most significant promoter-specific TFBS enrichments (twice or more) were observed for SPL12, SPL5, GBF1, ABI5, BZIP68, LEC2, and GT1 transcription factors.
To account for dinucleotide statistics matching that of the Fgenesh rice promoter regions, another set of randomly shuffled sequences was generated as described by Stepanova, Tiazhelova [38]. Briefly, the 2000 nt regions [TSS-1000, TSS+1000] were divided onto non-overlapping 100 nt windows, then the dinucleotide statistics were calculated for each window. For each promoter, a 2000 nt long sequence with matching dinucleotide composition was generated and subjected to MATCH prediction of TFBS [35]. After selecting the motifs that occur at least in 100 different rice promoters, Kolmogorov-Smirnov test was applied to find significantly over-represented sequence motifs (S3 Table). Fig 3 shows examples of TFBS that occur at frequencies that differ and do not differ significantly between real and simulated sequences. The most pronounced differences (p-value < 0.002) were detected for the distributions of binding sites for TCP15, LIM1, HBP1A, and TCP23. On the other hand, occurrences of binding sites for CMTA2, GATA1, SBF1, and WRKY48 in real and simulated sequences were not different (p-value> 0.99999).

Positional specificity of TFBS distribution
A phenomenon of the positional preference in TF binding was previously described by Weirauch, Yang [39], who showed that positions of TFBS are not randomly distributed in respect to the start of transcription (TSS); this observation holds across evolutionary kingdoms. To illustrate this phenomenon in rice, we divided the [TSS-1000, TSS+1000] regions into 100 nt long bins and calculated frequency histograms of TFBS occurrence in each bin; then we used K-means algorithm to cluster these histograms, with value at each bin treated as a separate dimension.
Positional clustering of TFBS demonstrates that the cells of Oryza sativa utilize three distinct classes of transcription factors: Class 1, which binds preferentially to the [-500,0] region ("promoter-specific", N = 188); Class 2, which binds preferentially to the [0,500] region ("5 0 UTR-specific", N = 282); and Class 3, which includes predominantly "promiscuous" transcription factors with weak or no location preference for respective TSS (N = 207), see S4 Table and  To conduct the comparative gene ontology analysis of Class 1, 2 and 3 transcription factors (Table 1), the chi-square "Goodness of Fit" tests were used: and E i correspond to observed and expected numbers of genes in i th category. Class 1 TFs of the rice are enriched in the following GO terms: "sequence-specific DNA binding", "protein dimerization activity", "systemic acquired resistance, salicylic acid mediated signaling pathway", "regulation of transcription from RNA polymerase II promoter", "response to bacterium", "jasmonic acid mediated signaling pathway", "carpel development", "protein binding", "negative regulation of defense response", "protein targeting to membrane", "regulation of plant-type hypersensitive response", "plant ovule development", "response to ozone". Class 2 TFs are enriched in GO terms "DNA binding", "ethylene-activated signaling pathway", "response to water deprivation". Class 3 TFs are enriched in "cellular response to nitrogen levels".
To compare expression specificity of Class 1 and Class 2 transcription factors, we used the difference of proportions test ( Table 2) Genes encoding the Class 1 TFs are predominantly expressed in the petals, the sepals and the embryos of plants, while mRNAs encoding Class 2 TFs are overrepresented in the roots. This may explain previous observations of significant association of TATA motifs with expression in plant roots    [4,39,41,42]. RAP2.6 is a defense-related, ethylene response transcription factor which recognizes the GCCbox and characterized by high affinity to DNA sequence GCGCCGCCG [43]. Ali, Abbas [44] experimentally showed that RAP2.6 works both in tissue-specific and stress-specific manner. Under normal conditions, expression of RAP26 is elevated in roots and stems, while being significantly reduced when plant is infected with pathogenic nematodes, such as H. schachtii. To suppress resistance responses, nematodes downregulate expression of RAP2.6 in host cells. MYB111 is involved in the regulation of several genes of the flavonoid biosynthesis pathway in cotyledons and leaves [45,46]; it confers tolerance to UV-B [47]. Its binding site MYB111_02 has consensus G[G/T]TAGGT[A/G] [43]. MYB111 is an example of TFs with relatively weak position specificity related to TSS. The TFBS motifs occur no very often; they usually provide condition-specific regulation of genes. According to the Kolmogorov-Smirnov test, three classes of TFs differ in the significance of over-representation of their TFBS in promoters and in the randomly shuffled sequences: thirty-seven percent of the Class 1 TFs with motifs located predominantly upstream of TSS were significantly overrepresented (p-values <0.05), In the Class 2 TFs with TFBS located in 5 0 UTRs, overrepresentation was confirmed for 20% of the PWMs. The TFBS for Class 3 TFs were distributed evenly. For the latter group, significant over-representation was detected for 15% of class members (S2 Table).
In summary, three classes of TFBS differ in their position specificity, percentages of PWMS significantly over-represented in real promoters, functional classification of their downstream genes, and the patterns of their gene expression.

Evolutionary conservation of TFBS position information content
We have analyzed evolutionary conservation of the TFBS position information content (a measure of unevenness of the motif distribution along promoter regions, see Method section) in monocots Oryza sativa and Zea mays, and in the dicot Arabidopsis thaliana (Fig 7). Following correlations between these measures were identified: Correlation of the TFBS position information content in two monocots (rice and corn) were higher than that for the rice and a dicot plant Arabidopsis. By extracting TFBS with more than 10,000 matches in each of three plant genomes, a list of 46 "common" informative TFBS was compiled. Each of these TFBS was classified into either "promoter-specific" or "5 0 UTRspecific" category in each species. Between rice and corn, 42 of 46 "common" TFBS are consistent in their position preference (see S7 Table). Between rice and arabidopsis, the agreement is seemingly higher, with 45 of 46 TFBS of the common set having the same positional preference (see Supplemental Data). We hypothesize that this discrepancy is due to lower reliability of the TSS map in corn genome as compared to arabidopsis and rice genomes (Fig 8). Importantly, this phenomenon may lead to a systematic "shifting" of the TFBS peaks from promoters to 5 0 UTRs and vice versa. Fortunately, incorrect prediction of TSS in corn does not affect the information content of a TFBS, and correlation coefficient of motif information content between two grasses (rice and corn) is 0.87, which is above the correlation between rice and arabidopsis (Fig 7). In summary, positional preference of the most informative motifs remains conserved between dicots and monocots. Nucleotide patterns aiding in prediction of eukaryotic promoters

Identification of similar TFBS
Since TRANSFAC database tends to accumulate all published motifs, some of collected motifs appear to be redundant. For example, several PWMs may be independently built and reported Nucleotide patterns aiding in prediction of eukaryotic promoters for the same transcription factor (Fig 9). Also, transcription factors of the same protein family may recognize highly similar motifs, which will be reflected by similarities of respective PWMs. Although regulatory functions of these may vary, for a practical use in promoter prediction, these motifs should be clustered into a non-redundant set based on similarity of their PWMs. By clustering 764 plant PWMs, a non-redundant set of 376 sequence motifs was obtained, among which, forty-six were found informative with the scores above 0.0138 (see S1 Table).

Nucleotide variants resulting in the TFBS loss and gain
Core promoters and 5 0 UTR regions located within 200 bp around the TSS are both protected against accumulation of nucleotide variants (Fig 10). This protective effect is due to selection constraints, which prevents disruption on regulatory elements located near TSS by neutral or near-neutral genomic variants. Cross-analysis of comprehensive collection of plant TFBS [37] and an extensive dataset of the genomic variants detected in various rice cultivars [27] allowed us to classify regulatory elements of these plants according to their tolerance to the mutations (see S5 Table).
To achieve that, we considered distribution of SNPs and their effects on loss and gain of TFBS. For each nucleotide change, we have calculated Δ = |q − q Ã | for the TFBS scores before (q) and after (q Ã ) nucleotide change, and compared its values to empirically determined thresholds. Calculations of the scores q and q Ã were done according to the MATCH scoring formula (see Materials and Methods). If Δ ! Δ 0 , the site was considered as "lost" or "gained" depending which score value was larger, q or q Ã .
Frequencies of site losses and site gains for the promoters and for the random subset of 18,389 intergenic sequences, each 2,000 nt in length, were compared. We hypothesized that functionally important promoter motifs will have less variation causing the loss of sites. For each TF, we calculated the ratio of the site losses in intergenic sequences to the site losses in promoters. All entries were than ranked according to these ratios, which reflected relative "suppression" of the site losses by SNPs (Table 3). Relative suppressions of the site gains were Nucleotide patterns aiding in prediction of eukaryotic promoters calculated in a similar fashion (see Table 4). The binding sites for ABF (CACGTGGC) and CBF4 transcription factors were the most protected from the site loss. In abscisic acid signaling, ABF factors govern osmotic stress response through modulation of the gene expression downstream of SnRK2 kinases, while CBF4 regulates adaptation to drought. For several important transcription factors, such as MADS8 (involved in the control of flowering time), GT-1 and GATA-1 (response to light), we observed that variation was avoided in positions where nucleotide change can lead to the site gain. Additional data and the results of the analysis of SNPs in TFBS could be seen in the Table 3, Table 4, and S5 Table. The binding sites for AT2G20350 and ARF1 transcription factors were the most "protected" from the site loss. AT2G20350 factors regulate activity of ethylene-activated signaling pathway. The plant hormone ethylene is involved in many aspects of the plant life cycle, including seed germination, root hair development, root nodulation, flower senescence, abscission, and fruit ripening (Johnson and Ecker, 1998). ARF1 is a member of the auxin response factor family, involved in hyperosmotic salinity response. For several important transcription factors, such as WRKY23 (involved in hyperosmotic salinity response and response to auxin), FUS3 (plays a role in embryonic development ending in seed dormancy and response to auxin stimulus), we observed that variation was avoided in positions where nucleotide change can lead to the site gain.

Distribution of RNA-Seq reads
Predictably, an analysis of mapped RNA-Seq reads near TSS [-1000; +1000] showed that, on average, coverage peaks are observed immediately downstream of TSS (Fig 11). However, some genes lack a peak of RNS-Seq reads at their TSS. Notably, only 26% of rice genes display a maximum of the coverage in the range [-50, +250], and only 60% of genes display this maximum in the range [-50, +550].

R-loop forming sequences (RLFS)
Three-stranded nucleic acid R-loop structure is formed between nascent RNA transcript and DNA template [48]. Length of the R-loop sequence varies between 150 to 650 nt. R-loops aid in the prevention of methylation within promoters [49][50][51] and are associated with initiation of transcription and other important gene-level features [48]. In particular, R-loops accumulate at the G-rich 5 0 -UTR regions immediately downstream of the CpG-non-methylated human promoters [50]. To map the R-loop forming structures in the area [TSS-1000, TSS+-1000], we used the QmRRFS tool [48,52,53]. QmRRFS partitions R-loops into three segments, the RIZ (DNA region of initiation of R-loops containing at least three contiguous guanines), "Frequency intergenic"/"Frequency promoters" is the ratio between frequencies of site loss due to SNPs located in intergenic regions and the site loss due to SNPs located in promoters.
https://doi.org/10.1371/journal.pone.0187243.t003 Table 4. Suppression of site gain caused by nucleotide variation in promoters. Column "Frequency intergenic"/"Frequency promoters" contains the ratio of the frequency of site gain due to SNPs located in intergenic regions to the frequency of the site gain due to SNPs located in promoters.  Table). Notable, this peak coincides with the region where polymerase typically pauses after the initiation of transcription [48,52,53].

DNA methylation
In the intergenic regions and within functional classes of genes and their promoters, the patterns of DNA methylation predictably differ [54,55]. The most pronounced effect was observed for the methylated CpGs (see Fig 12). Intergenic level of CpG methylation was at 0.27, with sharp decline starting around 600 bp upstream of TSS to about 50% of that in intergenic region level at the position of -170, then proceeds to its minimum (0.01) at 8 bp upstream from the TSS.

Combining the characteristic features of TSS into promoter classifier
We used 18,389 "promoter" (positives) and 18,389 "non-promoter" (negatives) sequences. To train the model, we used 14,711 positives and negatives; and the remaining 3,678 positives and negatives were used for testing. The binary classifier interrogates the candidate sequence and reports whether the sequence is "promoter" or "non-promoter". The best combination of features was: composition of DNA sequence, GC-skew value and presence/absence of the CAmotif in every position. It achieved the best accuracy (0.9995) and has the Matthews correlation coefficient of 0.9989 (see Table 5). Other features also improve the classification accuracy in comparison with the DNA sequence alone, however, not performing as well as the combination of DNA sequence, GC-skew and CA-motif distribution.

Discussion
In this work, we have investigated several features of promoter area, identified characteristic patterns of their distribution and assessed utility of these features for identification of TSS location. Accuracy of TSS identification affects the overall quality of regulatory region analysis. To date, large amounts of the "mapped" TSS are, in fact, defined only approximately. A significant fraction of promoters has multiple alternative TSS, many of which are not yet annotated. These features make prediction of exact positions of TSS a very complex problem. Further work toward exact mapping of all TSS positions using various promoter characteristics in multiple species is warranted. It is essential to find and annotate tissue-and condition-specific transcription start sites and associate them with alternative splice form, gene regulatory network, and protein function. Intelligent integration of multiple types of genomic information (DNA composition, regulatory elements, DNA methylation, RNA-Seq coverage data, SNP distribution etc.) may improve annotation of tissue-and developmental stage-specific genes that are often misidentified due to their atypical sequence composition in grasses [54][55][56]. We showed that the region containing promoter-UTR boundaries could be defined using the following pronounced trends: (1) drop in SNP density, (2) evolutionary conserved peaks and valleys of the positions of regulatory elements, (3) peak of RNA-Seq coverage immediately downstream from the TSS, (4) peak of CG skew, (5) drop in DNA methylation density in CpG, CHH and CHG contexts, where H denotes A, C or T nucleotide. Integration of multiple noisy features of promoter regions can result in 99% classification accuracy. Features identified as important by deep learning based classification can now be used to build a scoring function for promoter prediction.
In our work, we focused on the 2000 nucleotide long region around rice TSS, supported by experimental evidence. In rice, the median length of 5 0 UTR is 120 nt; with less than 1.2% of 5 0 UTRs being larger than 1000 nt [28]. Therefore, for the vast majority of loci, the considered regions covered both transcription and translation start sites, being sufficient for description and classification of rice promoters.
Analysis of SNPs in the context of TFBS in promoter and non-promoter region indicated that TFBS differ by their tolerance to nucleotide variation. It is of note that the binding sites for AT2G20350 and ARF1 transcription factors were the most "protected" from the site loss. Both of these factors are involved in plant hormone signaling [57]. We conclude that sites for these transcription factors are "protected" in evolution from being lost due to their importance for regulation of plant lifecycle. It was interesting to observe that for several transcription factors nucleotide variations were avoided in positions where nucleotide change can lead to the site gain. Among such factors were WRKY23 and FUS3, involved in gene regulation in response to the plant hormone auxin. We propose that spurious generation of novel sites for these transcription factors may significantly alter cellular timing. We conclude that TRANS-FAC analysis may results in functional observations as it provided clear evidence of interplay between SNPs and TF binding sites in rice genome.

Fgenesh++ rice gene prediction
Fgenesh++ (Find genes using Hidden Markov Models) [58][59][60] is a HMM-based ab initio gene prediction program [61]. We used the rice chromosomes (version MSU 7, [29]) to make the initial gene prediction set, applying the Fgenesh gene finder with generic parameters for monocot plants. From this set, we selected a subset of predicted genes that encode highly homologous proteins (using BLAST with E-value cut-off 1.0E-10) to known plant proteins from the NCBI non-redundant (NR) database. Based on this subset, we computed gene-finding parameters, optimized for the rice genome, and executed the Fgenesh++ pipeline to annotate the genes in the genomic scaffolds. The Fgenesh++ pipeline used all available supporting Nucleotide patterns aiding in prediction of eukaryotic promoters data, such as known transcripts and homologous protein sequences. NR plant and, specifically, rice transcripts were mapped to the rice genomic sequences, therefore identifying a set of potential splice sites. Plant proteins were mapped to the rice genomic contigs, and the high scoring matches were selected to generate protein-supported gene predictions, so that only the highly homologous proteins were used in gene identification. Amino acid sequences from predicted rice genes were then compared to the protein sequences from plant NR database using the 'bl2seq' routine, and the similarity was significant if it had a BLAST percent identity ! 50, BLAST score ! 100, coverage of predicted protein ! 80% and coverage of homologous protein ! 80%. BLAST analysis of the predicted sequences was also carried out against the O. sativa mRNA dataset, using an identify cutoff of >90%. Predictions that have both NR plant RefSeq and O. sativa mRNA support, as well as the 5 0 UTR longer than 20 nucleotides and shorter than 1000 were selected for the analysis.
GFF file with Fgenesh++ gene prediction is available as a Supplemental Data file.

MSU rice gene models
The current MSUv7 annotation (http://rice.plantbiology.msu.edu) of rice genome contains 55,986 predicted genes and 66,338 gene models [29]. Upon exclusion of pseudogenes, transposable elements, and genes with atypical lengths of 5 0 UTR (below 20 nt or above 1000 nt long), a high-confidence set contains 20,367 expressed protein-coding rice genes.

Arabidopsis gene and promoter models
Genome annotation files for TAIR 10 version and sequences for 3000 nucleotides upstream from ATG were obtained from The Arabidopsis Information Resource (TAIR) [24,62]. The upstream sequences were truncated based on the position of the nearest upstream locus. 290,085 EST sequences were obtained from NCBI and TAIR and mapped onto the 27,199 upstream sequences using nucleotide BLAST + (minimum identity percent: 95%; maximum query start of alignment: 5; only plus strand alignments were used). Using the text search, we removed ESTs annotated as 3 0 or partial. NPEST [5] algorithm was used and resulted in prediction of 17,452 transcription start sites for 16,520 protein-coding loci.

Corn gene and promoter models
Genome annotation of maize (B73, 6a) contains 40,602 predicted protein-coding genes [63]. We excluded genes with atypical lengths of 5 0 UTR (below 20 nt or above 1000 nt long), genes without full-length mRNA support, without valid start and stop codon, or no PFAM annotation. This filtering resulted in 16,180 putative corn TSS.

Positional information content of transcription factor binding sites
We selected TFBS that occur at least 10,000 times in promoters of a given species. In rice, it amounted to 487, in Arabidopsis -559, and in corn-171 TFBS. To calculate the information content of each TFBS, we divided the region around the start of transcription (TSS-1000, TSS +1000) into 100 nt long bins, and calculated the observed frequency of TFBS matches in every window as a ratio of matches within the window to the total number of matches f o ¼ m T . The expected frequency is calculated as f e ¼
The datasets were processed using the following protocol: 1. Duplicates were removed using tool clumpify (http://jgi.doe.gov/data-and-tools/bbtools/bbtools-user-guide/clumpify-guide/) allowing for up to two errors per read.
Summary statistics is shown in the Table 6.

Identification of transcription factor binding sites
The prediction of TF binding sites is done using the MATCH tool, which is based on the usage of information vector-based PWM model. This model calculates the matrix similarity score (q) defined in [35]. This model is a common additive model, which uses a transformed matrix instead of an initial matrix, where each column of the transformed matrix is determined with the help of weighting the corresponding initial column by information content. The matrix similarity score q is calculated according to the following formula: We have recently demonstrated that this strategy is superior to the common alternative approaches of computing the TFBS scores [37]. Site loss and gain. We analyzed distribution of SNPs and their effect on TF binding site loss and gain. The effect of a SNP on TF binding sites was computed as the follows. For each SNP and for each PWM model we computed two matrix similarity scores (see above): q and q Ã corresponding to two nucleotides in the SNP-the reference and alternative nucleotides. Next, we calculated Δ = |q − q Ã |, and compared its value to the empirically determined threshold Δ 0 . If Δ ! Δ 0 , the site was considered as "lost" or "gained" depending on sign of the difference q − q Ã .
We then calculated frequencies of site loss and site gain for all considered SNPs to identify which transcription factor binding sites (TFBS) are significantly enriched by the effect of nucleotide changes in SNPs analyzed. As a background, we considered random nucleotide changes in random genomic positions. We denote study and background sets briefly as "Yes" and "No" sets (the "Yes" set is the set of TFBS sequences overlapping SNPs with either the reference nucleotide or alternative nucleotide; the "No" set is the set created by random nucleotide substitutions in random genomic positions). The algorithm for TFBS enrichment analysis, called F-Match, has been described in Kel, Konovalova [66] and Koschmann, Bhar [67]. Briefly, the procedure finds a critical value (a threshold) for the differences between scores q and q Ã (the threshold Δ 0 ) of each PWM in the library that maximizes the "Yes/No" ratio R YN as defined in Eq (1) under the constraint of statistical significance: In Eq (1), #Sites and #Seq are the sites and sequences counted in "Yes" and "No" sets. A high "Yes/No" ratio indicates strong enrichment of binding sites for a given PWM in the "Yes" sequences. The statistical significance is computed as follows: The Yes/No ratio and P-value is computed separately for the site gain and for the site loss. If "Yes/No" ratio >1 and a P-value < 0.01 for a given PWM we consider this as an indication of an enrichment of SNPs by the sites for the given PWM. We can say that sites of this PWM are frequently effected by the SNPs and, therefore, the gene regulation by the respective TFs is significantly altered by the considered SNPs.
Matrix clustering. Many matrices in the TRANSFAC database are highly similar, up to the point being undistinguishable. To lower the complexity of the training data, we performed hierarchical clustering and used only one matrix from each cluster for promoter classification. The distance between two motifs is calculated as sum of squared differences between all matrix elements. If matrices were not the same size, we slide the shorter matrix over the longer one and take minimal distance. The cut-off for merging clusters was determined empirically by considering the sequence logos of matrices to be merged at each step and deciding which matrices we consider duplicates.

Classification of promoter regions
There are many network architectures and the task is to choose a suitable one for a given research problem. We used Convolutional Neural Networks (CNN) architecture for building promoter recognition models developed by Umarov and Solovyev [10]. The software consists of several modules. In the learnCNN.py modules the CNN model was implemented using Kerasa minimalist, highly modular neural networks library, written in Python. It uses the Theano library as a backend and utilizes GPU for fast neural network training. Adam optimizer was used for training with categorical cross-entropy as a loss function. Our CNN architecture ( Fig  13) consists of one convolutional layer with 200 filters of length 21. After the convolutional layer, there is a standard Max-Pooling layer. The output from the Max-Pooling layer is fed into a standard fully connected ReLU layer with 128 neurons. Pooling size was equal to 2. The ReLU layer is connected to the output layer with sigmoid activation, where neurons correspond to promoter and non-promoter classes. The batch size used for training was 16.
Input of the network consisted of nucleotide sequences where each nucleotide is encoded by a four-dimensional vector A (1,0,0,0), T (0,1,0,0), G (0,0,1,0) and C (0,0,0,1) and other dimensions filled by other promoter features such as: GC-skew, DNA methylation, SNP, presence of CA motif, presence of TATA motifs, TFBS. The output is a two-dimensional vector: "promoter" (1, 0) and "non-promoter" (0, 1) prediction. learnCNN.py learns parameters of the CNN model and outputs the accuracy of promoter prediction for the test set of sequences. It also writes the computed CNN Model into a file, which can be used later in programs for promoter identification in each sequence. We used 70% of these examples for learning, 10% for validation (to find an optimal number of learning epochs) and 20% for testing.
Quality of prediction was assessed using the following measures: True Positives (TP), True Negatives (TN), False Positive (FP), False Negative (FN), Accuracy, Sensitivity, Specificity, Matthews correlation coefficient (CC): Supporting information S1 Data. Annotation of Oryza sativa genome using the Fgenesh++ pipeline.

Authors contribution
Conceived and designed the study: TT and AK. Developed algorithms: VS, AK, TT. Analyzed the data: MT, VS, TT, AK. Contributed analysis tools: VS, AK. Wrote the paper: TT, AB, MT, AK. Participated in the editing of the manuscript: MT, VS, AB, AK, TT. All authors read and approved the manuscript.