^{*}

Analyzed the data: TM. Wrote the paper: TM. Helped to write and proofread the paper: HR MV.

The authors have declared that no competing interests exist.

Recent experimental and theoretical efforts have highlighted the fact that binding of transcription factors to DNA can be more accurately described by continuous measures of their binding affinities, rather than a discrete description in terms of binding sites. While the binding affinities can be predicted from a physical model, it is often desirable to know the distribution of binding affinities for specific sequence backgrounds. In this paper, we present a statistical approach to derive the exact distribution for sequence models with fixed GC content. We demonstrate that the affinity distribution of almost all known transcription factors can be effectively parametrized by a class of generalized extreme value distributions. Moreover, this parameterization also describes the affinity distribution for sequence backgrounds with variable GC content, such as human promoter sequences. Our approach is applicable to arbitrary sequences and all transcription factors with known binding preferences that can be described in terms of a motif matrix. The statistical treatment also provides a proper framework to directly compare transcription factors with very different affinity distributions. This is illustrated by our analysis of human promoters with known binding sites, for many of which we could identify the known regulators as those with the highest affinity. The combination of physical model and statistical normalization provides a quantitative measure which ranks transcription factors for a given sequence, and which can be compared directly with large-scale binding data. Its successful application to human promoter sequences serves as an encouraging example of how the method can be applied to other sequences.

The binding of proteins to DNA is a key molecular mechanism, which can regulate the expression of genes in response to different cellular and environmental conditions. The extensive research on gene regulation has generated binding models for many transcription factors, but the prediction of new binding sites is still challenging and difficult to improve in any systematic way. Recent experimental advances, notably high throughput binding assays, have shifted the theoretical focus from the prediction of new binding sites towards more quantitative models for the binding affinities of transcription factors, which can now be measured across whole genomes. Therefore we have developed a biophysical model which accounts for much of the observed variation in binding strength. Here we extend this framework to model not just the binding affinity, but also its distribution in various sequence backgrounds. This enables us to compare predicted affinities from different transcription factors, and to rank them according to their normalized affinity. What are the biological implications of such a ranking? We have demonstrated that many known associations between transcription factors and their respective targets appear as strong interactions. This provides a rationale to predict, for any given promoter region, those transcription factors which are most likely to be involved in its regulation.

Several experimental advances in the study of gene regulation have highlighted the fact that transcription factors have a certain affinity to all DNA regions, as evidenced by many experimental techniques, such as DNAse footprinting

We have recently shown that such a threshold is not necessary to understand and to quantitatively model a large amount of binding data from ChIP-chip experiments in yeast

While the TRAP-model allows to rank sequence regions according to their different affinities for a given transcription factor, it cannot always be applied to compare different transcription factors for a given sequence. This is because different transcription factors can have very different specificities, i.e. different distributions of affinities. In this paper we aim to remedy this situation by providing a proper normalisation, such that the binding affinity of different factors can be directly compared with each other. To this end we define a statistical score (p-value), which assigns the probability of observing a certain affinity or higher in a given sequence background. Here the goal is not to set some significance threshold, but rather to normalise an observed affinity in the light of a random sequence model, and to give a statistical meaning to the statement that one factor binds stronger than another.

In Section (2), we briefly review the TRAP model and introduce our notation. In Section (3) we derive the exact affinity distribution for an arbitrary motif matrix and for sequence backgrounds where all nucleotides are drawn independently from the same distribution (iid). We then show that this distribution can, to a large extent, be parametrized by an extreme value distribution, and that this effective characterisation can also account for the affinity distribution in non-iid sequences with variable GC-content. We also compiled a complete parametrization for 762 TRANSFAC matrices which can be used for promoter regions of variable length, and without having to repeat the statistical modeling. To highlight the biological relevance of our approach, we present a realistic application of our method to human promoter regions. We show that many known regulatory interactions can be infered based on the high affinity of the associated transcription factor to the relevant promoter region.

For many transcription factors, motif matrices have been constructed from alignments of known binding sites. Here we rely on the curated results and matrix descriptions provided by the TRANSFAC database _{w,}_{α}) records how frequently a nucleotide α has been observed at position

In our earlier work _{l}_{l}

The left-hand side illustrates how a given motif matrix (

Here _{0} is a positive, sequence-independent parameter, and _{l}_{l}_{l}_{w,}_{α}) be a

Here α refers to the actual nucleotide of the sequence at position _{w,max}_{w}_{l}

From the above ingredients we determine the expected number of transcription factors bound to a longer DNA sequence region of length _{l}

For a more detailed exposition we refer the reader to our earlier work _{0}, were determined from large-scale ChIP-chip data with many different factors and cellular conditions. Importantly, our earlier work also revealed, that they could be determined simultaneously for all transcription factors, and need not be tuned individually. The general TRAP model is then defined by λ = 0.7, and ln(_{0}) = 0.585*_{0}) also depends logarithmically on the transcription factor concentration, but this dependence is much weaker than the linear dependence on the motif width,

Given the general TRAP-model with fixed parameters, the affinity of a transcription factor to a specified sequence region depends only depends only on the sequence composition and the matrix description of the factor. For simple sequence models, it should therefore be possible to calculate the affinity distribution for any matrix exactely. For simplicity we assume a sequence background with a given GC-content, i.e. a given single-nucleotide distribution (π_{A}_{C}_{G}_{T}_{w}_{E}

It follows from the convolution theorem that the Fourier transformation of _{E}

Here the final step denotes the inverse Fourier transformation to revert back to the orginal representation This derivation is completely analogous to the approach by Staden _{E}_{a}

As before we consider the Fourier representation of _{a}

Now it is straightforward to derive also the distribution of total affinities for a sequence region with length

In practice we utilize a Fast Fourier Transformation (FFT), to evaluate all the above integrals numerically.

For a simple sequence model of identical and independently distributed nucleotides (iid), it is possible to calculate the exact distribution of affinities as described in the

Here we compare the empirical distribution (blue circles) of affinities calculated on 100,000 random sequences of length

Notice that our theoretical results completely agree with the empirical distribution of calculated affinities for a random background sequence with fixed GC-content. The figure also illustrates that, in general, the distribution is not easily parametrized and certainly not normally distributed. A similar point has been made previously in the context of score distributions

The reader should be reminded that for the above derivation we have assumed that all contributions to the total affinity are independent of each other. This is consistent with the assumptions of the physical model, Eq. (3) and Eq. (4). There are several matrices, for which this assumption does not hold as they possess a high degree of self-similarity, e.g.

While it is satisfying to obtain a theoretical expression for the exact distribution of affinities, this is not particularly convenient for practical purposes as the full distribution function would have to be stored for different region sizes, L, and different GC-contents. Moreover, ultimately we will seek to model promoter sequences which are not iid, but tend to have highly variable GC-content. Therefore we are now searching for a convenient parametrization, which can be used efficiently in practice. We recall the explicit goal of this project, which is to provide a proper normalization of binding affinities, such that different binding factors can directly be compared with each other. From the previous section is is apparent that a simple parametrization will not be possible in general. The best one can hope to achieve is an effective parametrization, which is indistinguishable from the empirical distribution, at some level of accuracy. In this section we will compare the ability of several standard parametrizations to model the distribution of 1000 affinities for several different sequence backgrounds.

To quantify the overall “failure rate” of a parametrization, we determined the number of matrices for which the Kolmogorov-Smirnov test discriminates between the parametrization and the empirical distribution at the level of _{KS}<0.05. Notice that this choice of threshold is arbitrary and that we do not consider _{KS} to be proper p-values, as the parametrization was obtained from a best fit to the data. However, we applied the same procedure consistently to all setups, which still allows us to compare the relative performance of different parametric models for different sequence backgrounds.

First, we generated 1000 random sequences (_{KS}<0.05). This indicates that for most matrices the log-normal parametrization

Next we tested the same random data against a generalized extreme value (GEV) distribution with 3 parameters (

This parametrization is motivated by the fact that the total affinity

Indeed, the GEV distribution accounts for the bulk of the empirical data very accurately. Only for 61/762 matrices this parametrization is not compatible with the actual distribution of 1000 simulated affinities (_{KS}<0.05). Such failures can be exemplified by the case of E2F, for which we did not find any suitable parametrization, left plot of

The left figure shows the QQ-plot of the numerical data against the fit to a generalized extreme value distribution, Eq. 9. This example is for one matrix (GAL4_01), for which we calculated affinities for 1000 random background sequences of length _{KS} = 0.85. It should be contrasted to the best log-normal fit (right), which gives (μ = −6.77(4), σ = 1.35(3)), but this parametrized distribution is significantly different from the empirical data, _{KS} = 0.0003.

Model | Logn(μ, σ) | Gumbel( | GEV( | GPD( |

Rand(GC = 0.5,L = 100) | 644 | 197 | 49 | 41 |

Rand(GC = 0.5,L = 1000) | 614 | 213 | 60 | 36 |

Rand(GC = 0.5,L = 2000) | 553 | 272 | 61 | 30 |

Rand(GC = 0.5,L = 10000) | 364 | 447 | 44 | 14 |

Rand(GC = 0.4,L = 2000) | 558 | 275 | 60 | 30 |

Rand(GC = 0.2–0.8,L = 2000) | 608 | 697 | 460 | 30 |

Genomic ( | 445 | 553 | 157 | 27 |

Promoters ( | 467 | 608 | 178 | 24 |

Here we summarize the performance of the parametrizations in terms of the number of TRANSFAC matrices (out of 762) for which the Kolmogorov-Smirnov test can discriminate between the empirical distribution and its parametrization. This Table illustrates that the GEV distribution with 3 parameters has far greater explanatory power (a lower “failure rate”) than simpler distributions with only two parameters (log-normal, Gumbel). If only the upper tails of the distribution are to be modeled, then the location parameter,

While the addition of one extra parameter seems like a small prize to pay for the much better coverage, one should remember that all parameters depend on the length,

Here the first line follows from 〈

The coefficients (_{0},_{1}) need to be determined for each matrix separately. To this end we extended our numerical simulations to a range of different region sizes, _{1}≈0) and close to zero for most matrices. In the light of this observation we also considered fits to a Gumbel distribution with one less parameter (

Based on our analysis of Eq. (12), we were led to a regression analysis for the parameters of the generalized extreme value distribution against the logarithm of the region size, log _{0}, _{1}) = (−0.14, 0.04), black line with _{F}_{0}, _{1}) = (3.03, −0.64), red line with _{F}^{−10}, (_{0}, _{1}) = (−20.8, 4.46), blue line with _{F}^{−15}. As was mentioned in the main text, the shape

To summarize this section, we succeeded in deriving an efficient parametrization of the affinity distribution for sequences with fixed GC-content and for different region sizes,

The above results indicate that for a simple background model one can predict the exact distribution of affinities and, to a large extent, find a relatively simple parametrization in terms of a GEV distribution (Eq. 9). The more relevant question for possible biological applications is, of course, whether the same is possible for genomic DNA with variable GC-content. To answer this question we repeated the above analysis for three non-iid background models:

randomly generated sequences with variable GC-content,

genomic sequences (from human chromosome 1).

human promoter sequences of 2000 bp centered around the transcription start site

The last choice was motivated by the specific application discussed in a subsequent section. From now on, we will always consider regions of fixed size

For the second background we considered a 2 Mbp region from human chromosome 1, which also has variable GC-content with an average of

Finally, we consider a background of 1000 human promoters (2000 bp centered around randomly chosen transcription start sites). Their average _{KS} = 0.05). This is much better than any other class of distributions we have tested. We provide a complete list of the parameters for 762 TRANSFAC matrices as supplementary material (

So far we focused on the derivation and parametrization of the whole distribution for small and large affinities. However, in practice we are hardly ever interested in affinities which are small in the context of some background model. Therefore the more crucial question is whether the tails can be modeled appropriately. To this end we invoke a theorem from Extreme Value Theory which states that under very general assumptions the tails of the distribution, above some threshold

Of course, the threshold

Now we show how to apply our results in a realistic setting. Consider, as an example, the promoter region of SRF, which we take to be a 2000 bp region centered at the transcription start site. The biologically relevant question is to decide which transcription factors are most likely to regulate the activity of SRF. We want to stress that our approach, as well as any other sequence-based approach, cannot answer this question in any fundamental way, as we only characterize the binding strength of a factor, but not its regulatory potential. For this more refined question one ultimately needs to invoke additional information, such as expression data

In the absence of such functional data, information on the binding strength may still guide biological investigations – an approach which is also taken for the analysis and interpretation of ChIP-chip data. However, a simple ranking of all known transcription factors according to their predicted binding affinity would not be very meaningful either. In our example, those factors with the highest calculated affinity are DFD, MINI20 and others, all of which have very high base affinities, but which are not specific to the promoter region of SRF.

In order to discriminate those factors which have high affinity specifically for the SRF promoter but not the background, we invoke the statistical approach and the background model defined above by the set of human promoters. In

Not Normalized | Normalized | ||

Matrix-ID | Affinity | Matrix-ID | −log[_{GEV}(Affinity)] |

I$DFD_01 | 51.53 | V$SRF_ | 3.02 |

V$MINI20_ | 10.07 | V$SRF_ | 2.89 |

V$MINI19_ | 9.40 | V$SRF_ | 2.77 |

I$ADF1_ | 8.18 | V$SRF_ | 2.72 |

V$AP2ALPHA_01 | 4.85 | F$CAT8_ | 2.48 |

V$ETF_ | 3.94 | V$SRF_ | 2.18 |

V$MUSCLE_INI_ | 3.60 | F$REB1_ | 1.95 |

F$FACBALL_ | 3.53 | V$OCT_ | 1.89 |

On the left site we give the top matrices, which are naively ranked according to their calculated affinity. On the right site we show the same number of top-ranking matrices, ranked according to the corresponding p-value from the GEV-parametrization.

As a further example, we also considered the promoter region of another transcription factor, E2F, which is a known auto-regulator

E2F | α1-antitrypsin | ||

Matrix-ID | −log[_{GEV}(Affinity)] | Matrix-ID | −log[_{GEV}(Affinity)] |

V$E2F_03 | 2.87 | V$HNF1_01 | 2.97 |

F$REPCAR1_01 | 2.70 | V$HP1SITEFACTOR_ | 2.77 |

V$E2F1_ | 2.36 | V$AP1_ | 2.31 |

V$E2F_ | 2.35 | V$HNF1_ | 2.24 |

V$E2F_ | 2.31 | V$AP1_ | 2.24 |

V$E2F_01 | 2.29 | V$AP1_ | 2.18 |

As in

While the regulatory mechanisms for all those genes are likely to involve additional sequence elements and transcription factors, it is encouraging that some of the known key players can be detected by our method. We should also point out that we have assumed a scenario of maximal ignorance. Frequently one may already have a list of transcription factors among which to choose the one with the highest relative affinity. For example, in the case studies above, one may have excluded all non-vertebrate factors, or further restricted them to those which share an expression domain with the gene in question, if this information is known. Here this was not even necessary and it bodes well for the applicability of our framework to other promoters.

To assess this point more quantitatively, we retrieved from ENSEMBL (version 45

For each of 567 human promoters we determine the ranks of the transcription factors which are known to bind this promoter, based on our affinity ranking. The red histogram shows the distribution of these ranks. For example, 136 of 567 promoters have a known regulator assigned within the top 5 of 554 vertebrate Transfac matrices. For 50 promoters a known regulator is also the top candidate within our ranking scheme. This should be compared to the red circles, which have been obtained from the same analysis, but with reshuffled promoter regions. The reshuffling was done 100 times to determine the standard deviation, which is shown as error bars. Finally we also ranked the prediction from a traditional approach which assigns a number of binding sites to each promoter region. Notice that many transcription factors can have an identical number of binding sites, which leads to ambiguous ranking schemes. To be conservative, we always assign the best possible rank (histogram shaded in grey). Compared to the traditional approach our method identifies a larger number of known regulatory interactions.

Next we want to assess the significance of these findings. One might expect a shift towards higher ranks, simply because we always use the best matching factor, if more than one is known to regulate a certain gene. To account for this effect we reshuffled the associations between factors and promoters, while retaining the precise distribution of factors per promoter. Then we repeated our analysis on 100 such randomized sets and obtain 100 corresponding histograms. From this we determine the average histogram and the standard deviation, which are shown as red circles with error bars. While there is indeed a slight increase towards higher ranks, it is also clear from

Now we want to compare these results with what one would obtain from a traditional hit-based analysis. To this end we employ the annotation method which was introduced by Rahmann

In our earlier study _{0}). In the current context, we have repeated such an analysis also for the ranking of transcription factors. We find that the results from _{0}) is artificially increased by 20% (

In this paper we adopted a novel approach to the modelling of protein-DNA interactions. Rather than identifying transcription factor binding sites, we quantify the affinity of a transcription factor to any given sequence region. In contrast to the traditional approach, we do not seek a specific threshold, and we do not study “hits” of transcription factors. Instead, we are seeking an appropriate normalisation, which allows us to compare the affinities of different transcription factors directly with each other. This is similar in spirit to the different normalisation procedures which are currently applied to experimental ChIP-chip data

In our earlier work we were mostly concerned with the ranking of different sequences for a given factor and we derived an optimal model to achieve just this. Here we addressed the more challenging task to compare the affinities of different factors for a given sequence. This requires an understanding of the affinity distribution, which can be used to define a comparable score (a p-value). We have shown that, for a simple background model, the exact distribution of affinities can be predicted directly for any matrix. For a given sequence, all transcription factors with matrix descriptions can be ranked according to how strongly their affinity deviates from its expected value.

While the exact analysis can in principle be repeated for uniform sequences of all lengths, we also provide a relatively simple parametrization (GEV-model), which is applicable for more than 90% of all matrices, and in which the length dependence can also be accounted for, through a regression analysis. Moreover, the GEV parametrization can also account for the distribution of affinities from sequences with variable GC-content, as long as the variability is not too strong. To demonstrate our approach in a realistic stetting, we have applied this normalization to human promoters with known binding sites. We find that matrices of known transcription factors tend to rank highly according to their normalized affinity. This has been illustrated by the example of the SRF promoter, which yielded a clear suggestion for a known auto-regulatory loop, i.e. a strong relative binding of SRF to its own promoter. Remarkably this link could be established without invoking any prior knowledge on the set of relevant transcription factors, and without sequence conservation.

We want to stress that even the best parametrizations used in our work leave room for improvement. While we have made an extensive effort to derive a simple characterization which is appropriate for most matrices, we clearly traded accuracy for efficiency (small number of parameters). More specifically, the GEV-model should not be used to estimate p-values very accurately. Instead, it represents an effective distribution which is appropriate at a certain level of granularity. As the distribution models in our study were derived from the empirical distribution of 1000 measured affinities, we do not expect accuracies better than ∝10^{−3}, even for those matrices for which we consider the GEV-model appropriate. Further improvements will likely come from a better description of the tails of the distribution, for which certain limit theorems ensure a universal behaviour, which may indeed be parametrised more accurately.

Alternatively, if one does not require a simple functional form, we have shown how to derive the exact affinity distribution using a characteristic function approach. In this context, further improvements will have to take into account higher order background models and positional dependencies. Here we have considered a zeroth order background model to derive the distribution _{ε}, and we assumed identically distributed affinities, _{l}

On the numerical side, it would be worthwhile to consider a better implementation of the Fourier transformation (and its inverse) over unevenly discretized domains. Our simple FFT implementation is straightforward, but it cannot accurately account for the region of small affinities, where the cumulative distribution _{i}_{i}

Notice that we took the matrices provided by TRANSFAC at face value, and did not pre-process them in any way. Clearly, many matrices are rather unspecific with low information content and correspondingly high baselevel affinities. In a more refined analysis one would probably want to remove them prior to the analysis, for example by invoking a quality measure as in

In summary, the combination of the physical binding model and the statistical normalization brings our theoretical predictions a step closer to the real world. To our knowledge, this is the first attempt to provide a quantitative measure which ranks transcription factors for a given sequence, and which can be compared directly with large-scale binding data. Its successful application to human promoter sequences, serves as an encouraging example of how the method can be applied to other sequences.

Supplementary material.

(0.00 MB TEX)

Number of promoters with rank r.

(1.24 MB EPS)

We would like to thank Stefan Haas and Peter Arndt for valuable comments and suggestions.