Comprehensive Human Transcription Factor Binding Site Map for Combinatory Binding Motifs Discovery

To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%–20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory “DNA words.” From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%—far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of “DNA words,” newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.


S1.2 Similarity search
The similarity search, also known as nearest neighbor, proximity search, or the post office problem (4,5,6,7), is defined as the search of closest points based on a similarity criteria or a distance function. Let M = (D, d) be a metric space for a domain of objects D and d : D × D → R, a total distance function. Given a collection S ⊆ D, we define three types of similarity queries Q: • k-nearest neighbor (k-NN) query returns the k closest elements to the query object Q. Formally, the set R ⊆ S, such that |R| = k and for any X ∈ R and any Y ∈ S − R : d(Q, X) ≤ d(Q, Y ). 1 Multiple sequence alignments of promoters of 5000 base pairs S1 • Range query: Given Q ∈ D and a range r ∈ R, it retrieves all the objects within a distance r. Formally, the set {X ∈ S|d(X, Q) ≤ r}.
• Range k-NN query: the intersection of the previous two types of queries.
With these three types of queries it is possible to obtain a set or cluster of similar objects given an input query object. The k nearest neighbor is a computing burden operation and therefore an indexing data structure is required to reduce the computational burden. We employed a sketch-based index (8) implemented in the open source engine OBSearch (9). The method transforms each object into a compact bit string representation that is easier to search. Since the index is compressed and it allows also to solve multiple queries simultaneously, the algorithm runs at a considerable higher speed than competing methods (8). When we use similarity search in this text, we "search for the closest" item (range query), or we perform the "nearest neighbor" search (k-NN query).

S1.3 TFBM similarity criterion
The similarity of two TF binding motifs (TFBMs) is calculated by applying the Pearson correlation (10,11) over the position weight matrices (PWM) (12) of the motifs. The matrix is transformed into a vector X, giving each element a position in the feature set. Thus, for two vectorized PWMs X = (x 1 , . . . , x n ) and Y = (y 1 , . . . , y n ), the Pearson correlation is defined as:

S1.4 TFBM prediction
Here, we generate a catalog of TFBM predictions. To do so, we extract sequences with intrinsic properties that TFBSs hold ( Figure 7D, main text). From these sequences, we generate TFBM predictions ( Figure 7E, main text). In the following subsections we explain these steps in detail.

S1.4.1 Compilation of "DNA words" dictionaries
The TFBS motif generation relies on the fragmentation of the genome in sub-sequences or windows of w bases with a potential regulatory meaning. First, we extract a list of DNA sub-sequences of w nucleotide base pairs that hold features intrinsic to TFBS. We call this list B w . We create lists with sequence lengths in the range w min ≤ w ≤ w max , where w min = 4 and w max = 16. To generate a list B w , we employ the techniques already outlined in Figure 7 of the main text. The objective of this step is to extract "DNA words" that are likely to be binding sites of TFs. This step does not focus on the word's location, but only on the fact that the word exists. S1.4.1.1 Conservation curve method The base pairs shared among multiple aligned species create a pattern useful to filter out spurious sequence noise. We denote such pattern as the conservation curve. To illustrate the procedure of obtaining the conservation curve, we focus on a particular window of w = 15 base pairs ( Figure 7A, main text). Figure 7C of the main text, shows the entropy curve for the motif MA0142 2 of Jaspar (dark blue line). The entropy curve H is calculated for each cell x i of each column X j of a PWM with the following formula: where p j is the probability of occurrence for the base represented by x i in the corresponding X j of the PWM. Additionally, we calculate the conservation curve (light blue in Figure 7C each position of the sub-sequence, we count how many times the position matches the rest of the species. Then, we normalize this value by dividing by the total number of the species of the alignment. Note the low conservation of the high entropy areas. The entropy is inverse to the conservation level. This relation does not imply that conservation must follow the same shape as the entropy but it gave us the notion that the conservation curve is a useful fingerprint of a TFBS. To generate the list of "DNA words" B w , we divide the set of TFs into two sets. The validation set V holds only TFs with sites of w base pairs. The training set T contains all the other TFs. For each experimentally validated binding site of each TF in T , we find all the positions in the set of multiple aligned sequences SEQ where the site appears. Then, we calculate the conservation curve for each of them. Since the TFs from SEQ have different lengths w, we normalize these vectors using polynomial splines (13) so that they match w. For each sub-sequence of length w in SEQ we compute its conservation curve. We find the closest conservation curve from the list of training conservation curves using a 1-NN query similarity search S1.2. The collection S is a set of w-dimensional vectors. The similarity criteria is the L 2 distance (10). For a pair of conservation vectors X = (x 1 , . . . , x w ) and Y = (y 1 , . . . , y w ), it is defined as: If their L 2 distance is less than a threshold α, then we consider the site a possible candidate "DNA word". This process creates a list sub-sequences of length w that hold conservation curves similar to those found for experimentally validated TFBSs. The parameter α was tuned to minimize the size of B w while keeping the sites of each TF in V . Figure S1A displays the α values used for each B w . S1.4.1.2 Permutation prefix method The order in which multiple-aligned sequences of different species can be reorganized with respect to a "DNA word" of a specific location is useful to eliminate irrelevant "DNA words". To exploit this ordering we designed the Permutation Prefix Method. A simple way to compare two sequences is to count the number of their different nucleotide bases. This calculation is called the Hamming distance (10). For each subsequence in SEQ of length w, the corresponding aligned sequences are ordered by their Hamming distance. Sorting species by closeness in the alignment generates a permutation that is used as a fingerprint to further reduce the size of B w . We calculate the Hamming distance of the window we want to evaluate against all the corresponding windows in other species. Then, we sort the subsequences by distance. In the case where two distances are equal, a consistent way to order the species is applied for all "DNA words". A permutation of 45 species 3 will  Table S1: Sizes of the unique "DNA words" in each B w after applying the conservation curve, permutation prefix and merging filters. The size of each B w is reduced considerably by applying the merge of conservation curve and permutation prefixes. create a unique fingerprint per window in SEQ. This represents a problem because we want to have common permutations in order to filter appropriately the windows. To address this issue, different researchers (14,15,16) have proposed to take only the top p elements of the fingerprint. We hypothesize that evolution occurs at different rates along the genome and therefore a general phylogenetic tree might not consider this. Evolution does not have necessarily to work uniformly at global scale, and local genomic regions may have different conservation levels among species (17). Furthermore, the multiple alignment heuristic employed may also introduce artifacts. Our method attempts to identify these artifacts so that if they are common, they will be handled appropriately. The permutation filter tries to compensate these changes by looking locally at interesting targets. We take the training set T and extract the permutation fingerprint but we only keep the top p closest species (see section S1.2). In the example of Figure 7B (main text), the first p = 8 species are selected creating the following permutation prefix: [panTro2, gorGor1, ponAbe2, rheMac2, papHam1, otoGar1, tupBel1, camFam2] The next step is to find the largest p such that the sites of the validation set V are preserved. Following the conservation curve method of the previous section, we obtain the permutation fingerprint for every sub-sequence of length w in SEQ. We output each "DNA word" only if there is a permutation fingerprint in the training set that matches exactly their first p species. In Figure S1B the p values used for each B w are displayed. As w grows, the probability of finding equal prefixes diminishes, and therefore p must be reduced. S1.4.1.3 Merge method The final step of this phase ( Figure 7D, main text) is to merge the result of the conservation curve method and the permutation prefix method to include only common elements to both sets of "DNA words". Table S1 shows the resulting sizes for each B w generated by the conservation curve method, the permutation prefix method and the merging method. Each reduced B w allowed us to predict novel TFBMs. This reduction is crucial for increasing the quality of the results and for making it computationally feasible to generate the predictions. S4 S1. 4

.2 Completion of the TFBM map through clustering
Once we have a set B w of "DNA words" of potential regulatory meaning, the next step is to extract biologically meaningful motifs. In this step we employ the entropy curve extracted from the motif's PWM (equation 2). For each column of the PWM we calculate its entropy. The outline of the procedure is as follows: 1. Assume that each "DNA word" in B w is the center of a cluster of "DNA words" of size k. This cluster is generated by obtaining the k closest (measured with a conglomerate distance) elements in B w to the word we are currently evaluating.
2. Find the subset of elements that create an entropy curve that matches better the entropy curves generated from the training set T .
3. Rank each cluster according to a quality criterion based on a phylogenetic conservation and on the average entropy. 4. Filter predictions that do not follow the general patterns of motifs using the dynamic dimension selection filter.
5. Remove clusters that share similar motifs and group similar clusters. S1.4.2.1 Cluster computation based on conglomerate distance For each "DNA word" in B w we obtain the k closest elements in B w . To find them, we employ the k-NN query of the similarity search techniques described earlier in S1.2. The collection space are the B w "DNA words" and the distance function employed is described as follows. To find the closest subsequences, a similarity criterion is required. The Hamming distance described in section S1.4.1 could be employed. Nevertheless, we obtained better results with a customized distance function. This function is specially useful for large w. The distance function gives an extra weight to clusters of similar bases. In Table S2 we show six examples of our novel conglomerate distance. The conglomerate distance function is smaller when a larger number of contiguous bases or "chunks" is equal. E.g., the first two cases have a Hamming distance value of 5. On the other hand, conglomerate distance returns a smaller value for the first example. This is because the common elements are next to each other. The same holds true for examples 3 and 4. Since there is a larger contiguous segment in example 4, the conglomerate distance returns a smaller value than in example 3. Figure S2 displays the conglomerate distance algorithm. In line 10, we increment "counter" that holds the number of consecutive equal bases we have seen so far. In this way we can keep track of equal "chunks". In Line 16, we reset the counter and store in "score" the score calculated by the function "cc" (Line 27). We measure in this line how many other bases accompany the current base in the chunk. Finally, in Line 24 we return the difference of the calculated score and the score of a perfect match. Given the conglomerate distance function and a set of sequences B w , we extract a set of the k closest sequences for each sequence in B w . We used k = 20 in all the experiments. S1.4.2.2 Sequence cluster fitting heuristic based on entropy curve After obtaining "DNA word" clusters of size k, we must take a subset of these sequences that matches better the entropy curves (eq. 2) of known TFs. We employ the training set T of TFs to remove elements of each cluster that do not contribute to generate an entropy curve that follows the pattern of known TFs. Figures 7C and E (main text) show natural patterns that occur in known TFBMs. The algorithm tries to choose sequences that match already known entropy patterns. Figure S3 shows the cluster fitting heuristic. The heuristic attempts to find the best possible fit for each training TF (line 4).
In line 12, we measure the variation of the Euclidean distance of the "fitted" cluster if we incrementally add each sequence in "input". We keep adding sequences as long as the score keeps decreasing. As soon as this is not the case, we stop the loop and store a partial solution (line 20). Finally, in line 22 we return the fitted cluster with the best score.   "ACTGGGGACT" 5 75 = 10 2 − 5 2 "GGTGGGGCAA" "ACTGGGGACT" 2 60 = 10 2 − 6 2 − 2 2 "ACTGGGAGCT"

5
"ACTGGGGACT" 1 19 = 10 2 − 9 2 "ACTGGGGACA" 6 "ACTGGGGACT" 0 0 = 10 2 − 10 2 "ACTGGGGACT" S1.4.2.3 Cluster prediction ranking Once the clusters are fitted, they are ranked with a quality criteria. In section S1.4.1, we keep for each "DNA word" of length w, the number of times the "DNA word" is conserved c and the total number of appearances of the "DNA word" t. With the recorded c i and t i of each site i of the cluster, we calculate the phylogenetic conservation score dividing the total number of conserved instances by the total number of instances, score = c i / t i . This score is similar to the one implemented by Xie et al. (11). Additionally, to deal with large w in which sequences appear only once we defined a "cscore" normalizing, the total number of appearances by the maximum number of appearances max(t) registered. We add the phylogenetic conservation score and the cscore to create a ranking value.
Higher values of the phylogenetic conservation score represent a better score. Finally, in the case the score is the same, the order is decided by the average entropy: where H is the entropy (Eq. 2) and X is the PWM with w columns. S1.4.2.4 Sequence cluster filtering based on dynamic dimension selection (DDS) The ranking step of the previous section generates a large number of potential TFBMs. We create a feature set for each potential cluster and then we employ a filter to remove clusters that do not possess certain properties intrinsic to TFs. The feature set consists of the following properties: 5. Entropy curve (normalized to w components).
6. Distance distribution (the histogram of distances between all the sites in the cluster including multiple appearances of the same site (normalized to w components).
This ensemble generates a feature vector of (2w + 4) components per cluster (items 5 and 6 contain w features each, plus the first four items). Since the training set T for sequence length w is formed of the sequences with lengths w T = w (with w min ≤ w T ≤ w max ), we need to rescale the lengths of features 5 and 6 of the training set to the same length w, using a polynomial spline (13) approach. Next, we train the dynamic dimension selection (DDS) method. This method filters out clusters that do not follow intrinsic patterns of TFs. One approach is to employ an L p metric (10) to find similar features based on a distance threshold. Another approach is to employ a one class support vector machine. These two approaches did not filter out enough number of clusters, therefore we had to implement a novel filtering technique. The proposed DDS filtering technique is loosely based on the ideas of Tung et al. (18) and Aggarwal et al. (19). These authors recognized the need to take into account a subset of the feature vector to compensate for large differences in magnitudes between components (even when the magnitudes have been normalized).
An illustrative example was presented in Tung et al. (18). Consider the following 10-dimensional query object: Q=(1,1,1,1,1,1,1,1,1,1) The search of the nearest neighbor in Table S3 based on the Euclidean distance will return object 4. Nevertheless, if we consider the other three objects, each component is closer except for those with a value of 100. Tung et al. (18) and Aggarwal and Yu (19) tackled this problem by only focusing on components with high similarity but at the end their algorithms merge the distances into a single final. The six different types of features (encoded in a (2w + 4) dimensional vector, Table 2 of main text) correspond to properties of different nature, and thus, are not directly comparable. We propose a filter that takes into account the insights of Tung et al. (18) and Aggarwal and Yu (19) while avoiding different dimension mixing. Thus, normalization is not required and the notion of distance function is not applicable anymore.
DDS receives a feature vector, a parameter dimT hreshold that indicates the minimum number of components to output a match, a parameter iterT hreshold that indicates the maximum number of iterations the algorithm must perform before it reaches the minimum set of components, and a parameter hamT hreshold used to perform a final filtering step. Given a d-dimensional object query Q = (q 1 , . . . , q d ), for each dimension q i DDS orders the objects in the database according to the distance to q i . In the Table S3 example, for q 1 = 1, the object IDs list would be: ((3), (1), (2), (4)) 4 . There could be cases where a component value is the same for a number of objects. 4 The parentheses enclose a list of elements S9  4  18  4  1  5  347  15  3  6  104  10  1  7  204  17  2  8  215  17  2  9  87  13  1  10  80  17  0  11  95  20  0  12  53  19  0  13  123  24  0  14  108  26  0  15  291  31  0  16 70 20 0 In those cases the object is returned in a list. I.e., in the case of q 7 , the returned value would be: ((1, 2), (3), (4)). These sorted results L = (l 1 , . . . , l m ) (where m is the number of object in the dataset) are iterated one list l j at a time. Figure S4 shows the complete pseudo-code of the DDS algorithm. In line 11 we perform this first matching operation. Next, we record incrementally the object IDs found in each individual component (line 21). As soon as we find a certain object ID that appears at least dimT hreshold times, we stop the search (line 25). If this break condition occurs, then j ≤ iterT hreshold holds (line 30). If we exceed the number of iterations iterT hreshold the condition in line 30 returns false. We have added an additional filter in this line. We store the components used to validate the training data. We consider the set of components a binary vector. Those positions used to accept the feature vector are set to 1. Given dimT hreshold and iterT hreshold, we have precomputed a binary vector B for all the objects in the database (when the object in question is removed). The function hammingF ilter finds the closest mask in B for each query Q. We accept the result only if the Hamming distance of the closest vector and the query is ≤ hamT hreshold. The process of finding masks also helps to find suitable iterT hreshold and hamT hreshold parameters. It is only required to choose an appropriate dimT hreshold. Table S4 shows the parameters used to filter our predictions for different w. DDS is a powerful filter that reduces the cluster predictions to a fraction. Normalization steps are required for different items if one wants to employ techniques like the one class support vector machines (20). Nevertheless, our method treats each single dimension independently and therefore normalization is not required. Instead of dealing with a range of values that mix different component values, we only use iterT hreshold as a substitute. Due to this ability of reducing the number of predictions (Table S5), and the fact that B w was already reduced substantially by the filters designed to compile the "DNA words" (Table S1), the method was able to produce a manageable number of TFBM predictions. S1.4.2.5 Similar motif removal and cluster merging The final step in the TFBM prediction is the removal of clusters where the 80% of sites overlap. We remove the cluster that has a lower ranking score. Finally, we create groups of clusters that share Pearson similarities above 0.85. With this step the prediction of TFBMs is complete. The output is a list of groups of "DNA words" clusters that are potential TFBMs.

S1.4.3 Predicted motif validation and significance analysis
Given the list of TFs extracted from Jaspar and Transfac databases (Section S1.1), we calculate how many of these motifs are actually predicted by our algorithm. We employ the Pearson correlation (section S1.3) with a threshold  of 0.85 to accept a prediction as successful. We performed a total of 509 validations against experimental data and 416 had a Pearson correlation greater than 0.85. We generated also a set of random control PWMs to estimate the probability that our motif predictions occur by chance. For Jaspar and Transfac PWMs we shuffled the elements of each column while preserving the column order. This is to account for nucleotide compositional biases (11). For each of our predictions, we calculated how many had a similarity threshold of 0.85 or more against this random dataset. Only 1% of our predictions matched this control dataset.

S1.4.4 TFBS prediction
Given each B w previously generated (section S1.4.1), we search the upstream5000 dataset by only looking at windows belonging to B w . Since each sequence holds features intrinsic to TFBSs, we only concentrate on this "distilled"version of promoter regions. Based on the Berg-von Hippel method (21) we proceed as in Sarkar et al. (22), the binding energy m of the affinity of a motif with a PWM σ to a sequence X = (x 1 , . . . , x w ) is calculated by: where σ ′ is the background frequency of the four nucleotides across B w . P (x i |σ i ) denotes the normalized frequency of a nucleotide x i in the column σ i of the position weight matrix. By obtaining the binding energy with m for each sequence in B w , we calculate a binding energy z-score for each motif. To achieve this, we compute the distribution of binding energies of a motif against the entire B w . We only accept motifs that have an energy z-score of 2.0. Additionally, we stored the average number of matching species per base for each sequence.

S1.5 Combinatorial binding pattern generation
Once the catalog of individual TFBMs is obtained, we find the way how they group with each other forming combinatorial binding patterns (CBPs). The CBP algorithm searches for similar motif sets and then tries to align them taking into account the following parameters: • minSize: Minimum number of motifs to appear in the CBP S12 ✞ 1 / / CBP g e n e r a t i o n 2 / / g : b a s e g r o u p u s e d t o g e n e r a t e a new CBP 3 / / g s : c o m p l e t e l i s t o f g r o u p s 4 Group g e n e r a t e C B P ( Group g , L i s t <Group> gs , 5 i n t m i n S i z e , i n t m i n I n s t a n c e s ) { 6 / / g e t g r o u p s t h a t s h a r e a t l e a s t m i n S i z e m o t i f s . 7 Group r e s u l t = g ; / / r e s u l t s e t 8 f o r ( Group s g : s i m i l a r G r o u p s ( g , gs , m i n S i z e ) ) { 9 / / f i n d common e l e m e n t s 10 Group r e s u l t t = s g ∩ r e s u l t ; 11 / / make s u r e r e s u l t >= m i n I n s t a n c e s 12 i f ( c a l c u l a t e I n s t a n c e s ( r e s u l t t ) < m i n I n s t a n c e s ) • t: Maximum number of base pairs between motifs We define a "group" as a set of motifs each separated by no more than t base pairs. The first step of the CBP generation algorithm is to create these groups. Figure S5 shows the pseudo-code of the CBP generation algorithm. For each extracted group g, the algorithm starts by searching for groups that share at least minSize motifs (line 8). Once this candidate list is found, it is ordered by the amount of similarity shared with g using the similarGroups function. In this case, similarity is the number of shared motifs and higher similarity is better (we break away from the traditional definition of distance). The algorithm greedily adds candidates while minInstances holds true (line 12). At each step we gather common motifs into the "result" variable (line 10). Next, we perform a multiple alignment using the center star algorithm (23). This algorithm finds a center group and then uses it to compute pairwise alignment with the other sequences adding spaces as needed. The pairwise alignment is calculated with the Smith Waterman (24) aligner. We set the difference cost to 5, and the open and extended gap penalties to 0. Motifs must align in at least minT hickness groups to be considered. After the alignment, we check if the generated CBP still satisfies minSize and minT hickness. Thus, we generate CBPs with topologically aligned motifs. We employed the following parameters: t = 1000, minSize = 3, minInstances = 3, minT hickness = 3.

S1.5.1 CBP visualization
We created a CBP visualization computer program that shows an abstract representation of the motif alignment in different promoters. We employed graphviz (25) and Web Logo (26) to represent our alignments. S13 Each CBP figure contains a list of metrics that summarize relevant characteristics of the CBP: • COV (coverage): The average number of base pairs per group that are available for binding.
• GEN (generality): Number of genes where the pattern holds true.
• ROB (robustness): The average number of times a motif is repeated within the pattern.
• SIG (significance): The z-score of the actual GEN in the distribution of GEN that would be generated if the promoters are randomly shuffled (section S1.5.2).
• AL: The multiple alignment score, higher values represent better alignment. The cost matrix holds the same value, 5, in each cell.
In our visualizations, each arrow connecting two motifs a and b means that a motif b is to the "right" of motif a. The thickness of the arrow measures the percentage of promoters where this relationship occurs. The gradient "D" represents the number of base pairs that separate each binding site.

S1.5.2 CBP significance analysis
To measure if our CBPs could occur at random we apply iteratively the following procedure: 1. Shuffle all the motif groups.
2. For each CBP, count how many promoters still hold the motifs of the CBP in the randomized group set. We store this value so that we can generate a z-score during the final stage.
For each CBP we generate a z value based on the number of promoters where the CBP is present in up-stream5000 with respect to the appearance in a random promoter set.

S1.5.3 Predictions with CBP significance
We take a subset of the motif predictions generated in section S1.4 based on the CBP significance score previously calculated. We take only those predictions that appear in CBPs with a SIG z value of 3.0 or greater. The set of motif predictions with CBP significance is denoted by STFBM.

S1.6 Computer program implementation
All the algorithms of sections S1.4 and S1.5 were implemented in Java 1.6. The visualizations of section S1.5 were implemented in Clojure 1.2. Gene ontology enrichment statistical significance analysis was developed in Matlab (R2009b).
To reduce the long computing times, we used a wide range of machines. E.g., a 120 core, AMD Opteron 2.6 GHz cluster with a total of 160 GB RAM to a 512 core Itanium 2 1.6 GHz cluster with a total of 1.4 TB RAM.

S2.1 TFBM and CBP significant ontologies representation in ontology maps
The designed ontology maps are projections of the high dimensional space of STFBMs and CBPs and their respective statistically significantly enriched ontologies. Figure S6 shows cellular component ontologies for the CBPs. Figures S7, S8 and S9 show the molecular function, biological process and cellular component ontologies respectively, for the set of TFBMs that are embedded in the CBP set (denoted as STFBMs). Each CBP or each STFBMs is linked to its corresponding set of significant molecular function ontology terms. CBPs or STFBMs that share common ontology terms are also linked together. Regions with the same color represent the same cluster of objects. Borders of the "countries" underline boundaries of CBP or STFBMs clusters. The ontology term font size is adjusted to reflect the frequency with which it is associated to the CBPs or STFBMs.
As in the case of the CBPs, for the STFBMs there is a strong transcription-related component in the lower left region of Figure S7. Similarly, in Figure S8, development-related ontologies are strong in the center-left region of the figure. Table S6: Predicted TFBMs and the corresponding validations that matched a known binding motif in the Transfac or Jaspar databases. The values inside parenthesis represent the correlation for the case in which the match is performed with a motif inside the cluster that is not the top element of the cluster. The ID of the motif is shown alongside with the binding motif logo for our prediction and the known motif. We denote as ρ the Pearson correlation between our prediction and the known motif. The Jaspar and Transfac motifs are plotted using the original sites. The Pearson correlation is calculated over the motifs that can be created over the original "DNA word" dictionary. The motifs are sorted by motif length.       Table S8: Ab initio predicted motifs that appear frequently in CBPs (STFBM). The position within this subset is included alongside the prediction ID, the motif logo, the significance z score, the corresponding CBP ID. Additionally, the most significant molecular function, biological process and cellular component ontologies are included for each motif.

CAA TC TG A
Inf.

S2.5 CBP gallery
We selected a few CBPs for display in this gallery according to the following criteria: • The CBPs applies to more than 10 genes.
• SIG is greater than 3.0.
• At least one of our ab initio TFBM predictions are included in the CBP.
In case the CBP has significant ontologies, they are listed in each following row. We explained in section S1.5.1 how to interpret the visualizations. Table S9: CBP Gallery. In case the CBP has significant ontologies, they are listed in each following row.  Table S10: Set of data files provided in our catalog. There are several types of data sources. The type "files" implies a set of files. The type "file" refers to only one file. The type "dir" is a directory with files. Each individual file is an entry in the dataset.
Type Name Description dir Motif-Predictions Includes the list of all predicted motifs. Each motif filename has the form "b-$w-$n-$s" where w is the length of the motif, n is the motif group and s is the identification within the given motif group. files motif-locations-$w.csv Genomic locations were motifs from Jaspar, Transfac and our predictions bind. These files were used for CBP generation. files predictions-summary-$w.csv The main component of each prediction group is recorded in each line of this file. files predictions-full-$w.csv Each prediction group is displayed. Entries of the group are stored in each line of this file. file CBP-Visualization.zip Every entry is of the form "c-$id.eps", where id is the CBP identification number. file cbps-summary.csv Summary of CBPs found (motifs and genes).
• Local conservation z-score: The conservation of this window with respect to other locations. NaN if the window exists only in one genome location.
• Global conservation z-score: The number of standard deviations that the conservation departs from the mean conservation of all windows in B w .
• TF binding energy z-score.
• TF name (our predictions start with "b-", Jaspar's motifs start with "MA-". The other predictions, that have a $ symbol in their identifier, are from the Transfac database).
• Standard gene name.
• Position inside the promoter. The value 0 is just before the beginning of the transcription.

S3.3 Files: Predictions-Summary/predictions-summary-$w.csv
Each line in this file contains the highest ranking element of the motif prediction groups. The first column contains the row id. The second column holds the motif prediction identifier. This identifier is of the form: "b-$w-$n-$s", where w is the length of the motif, n is the motif group, and s is the identification within the given motif group.
The third column contains one or more Pearson correlation results for the matching motifs in the fourth column.

S3.4 Files: Predictions-Full/predictions-full-$w.csv
The format of this file is similar to "predictions-summary-$w.csv", except that all the entries of the groups are displayed here.

S3.5 File: CBP-Visualization.zip
This compressed file contains PDF files with all the CBPs found. Every entry is of the form "c-$id.pdf", where id is the CBP identification number.

S3.6 File: CBP/cbps-summary.csv
The summary of each CBP is included in each line of this dataset. The following fields are included in each column: • CBP ID: CBP identification number • Motif counts: Number of motifs included in the rule.
• Score: Sum of the alignment scores against the base sequence.
• Significance: The z-score of the actual GEN in the distribution of GEN that would be generated if the promoters are randomly shuffled (section S1.5.2).
• Generality: Number of genes in the CBP.
• Coverage: Number of bases where motifs are binding S62 • Robustness: Number of times an individual motif is duplicated in the CBP.
• Motif names: One or more columns are the motif names of the CBP • "|" Separation token. This character divides the previous list from the next list.
• Gene names: Genes where this CBP applies.