^{*}

Conceived and designed the experiments: AR HL IS. Performed the experiments: HL. Analyzed the data: AR HL. Contributed reagents/materials/analysis tools: HL. Wrote the paper: HL. Other: Compiled and created the test set: AR. Supervised the study: IS.

The authors have declared that no competing interests exist.

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at:

Transcriptional regulation is a central control mechanism for many biological processes. Transcriptional regulation generally involves DNA-binding proteins, transcription factors (TFs), that control gene expression by binding to short regulatory sequence motifs in gene promoters

Computational approaches to TF binding site analysis can be divided into two categories,

TF binding

Although motif discovery methods are relatively well-developed, the TF binding prediction problem has attracted less attention. Most of the previous binding site prediction tools have been formulated as hypothesis testing methods, where a significance value of TF binding at a specific sequence position is obtained by comparing a test statistic to a null distribution

Here, we formulate a probabilistic framework for TF binding prediction that differs from the standard hypothesis testing approaches in three important ways. First, the proposed framework is probabilistic in nature and thus outputs a probability of binding (as opposed to a

To validate our computational methods, we constructed a test set of annotated binding sites in mouse promoters from existing databases

Because our proposed computational framework is based on probabilistic modeling, it provides an intuitive interpretation (i.e., probabilities, not

We construct our basic probabilistic framework using commonly used models for binding and non-binding sites, although generalizations to more complex models are straightforward. Consequently, our initial formulation is similar to other previously proposed TF binding prediction methods

Although applications that combine binding site prediction with other data sources, especially “phylogenetic footprinting,” are plentiful (see e.g

Although our general aim is to integrate as many lines of evidence as possible into TF binding prediction, we restrict our focus to those data sources that contain useful information for TF binding at the genome level. For example, we do not consider functional data sources, such as gene expression or protein level measurements, or descriptive higher-level information, such as gene ontology. Despite the fact that gene or protein expression measurements alone can be informative of transcriptional regulatory relationships and, therefore, indirectly informative of TF binding as well, we will not include those data sources into our modeling framework. For example, proper modeling of gene expression or protein level measurements does inevitably require the use of a predictive network model, such as a (dynamic) Bayesian network

To motivate the results presented in this section, we briefly outline our computational methods—first for the basic probabilistic formulation without any additional data sources and then for general data fusion. A detailed description of the methods is presented in ‘

The most commonly used probabilistic models for binding sites and background sequences, on which our methods are also built, are the position specific frequency matrix (PSFM) model

Motif and background models are denoted by θ and φ, respectively. For the cases where a TF is associated with more than one motif model, we denote multiple motifs by Θ = (θ^{(1)},…,θ^{(m)}). A key (unknown) quantity is the number of binding sites _{1},…,_{N}_{1},…,_{c}^{c}

The diagram illustrates the upstream promoter region for a gene, where the direction of transcription is indicated by the direction of the arrows. The arrows are located at the transcription start sites.

Given

Because motif models are typically constructed from a relatively small number of experimentally verified binding sites (or from an output of a motif discovery algorithm) they can contain a considerable amount of uncertainty. Thus, it is also useful to consider a Bayesian approach where Θ and φ are random variables. Given

We demonstrate our computational methods on a carefully constructed test set that contains annotated binding sites in 47 mouse promoters. For the positive cases we use those TF-promoter pairs that contain an annotated binding site for a TF in a promoter. Negative cases consist of TF-promoter pairs that (are expected to) have no functional binding sites. Although negative sets in general are likely to contain some true binding sites (those that are not yet discovered), that does not invalidate our performance evaluation but merely introduces a degrading bias in our results. Similarly, some of the annotated binding sites may have been used to construct the PSFM models that we use in our analysis and that in turn can introduce an optimistic bias into our results. The above-mentioned biases are impossible to avoid in practice. Fortunately, this is not a major issue, especially as long as we use the same test set and the same PSFM and Markovian background models for all the new and previously published methods that we compare. The full details of the test set can be found in ‘Data’ Section. For the binding specificities, we use (scaled) motif models from TRANSFAC Professional version 10.3. We measure performance using standard receiver operating characteristics (ROC) curves that plot the fraction of true positives (sensitivity) versus the fraction of false positives (complementary specificity), see e.g

Before proceeding to more interesting results, we first test the effect of some of the parameters in our probabilistic formulation. A natural parameter to start with is the order of the Markovian background model. Several authors have reported that the use of higher-order background models improves motif discovery

Another useful preliminary test is to vary the prior probability of having

Adopting a Bayesian approach allows the modeling of uncertainty in the model parameters as well. Although we use the Dirichlet prior distribution for both the motif and background model parameters, we are primarily concerned with the uncertainty in the motif models, since the background model parameters are estimated from a much larger data set. Hyperparameters of the Dirichlet prior control the amount of uncertainty in the motif model parameters and that can also provide regularization for the inference problem (see

(a)

(a) Likelihood

In order to better assess the performance of the proposed methods, we also compare our basic likelihood method with traditional promoter scanning (see

The background model order is (a)

Many TF binding sequences are relatively short and non-unique and hence, the expected number of their occurrences in a genome by chance is high. These presumably non-functional binding sites cause traditional TF binding site prediction methods to have unacceptably high false positive rates. Although the above probabilistic formulation provides a principled framework for TF binding inference and allows regularization via a Bayesian approach, being built on the same modeling framework as other methods (PSFM motif and Markovian background models), it is also vulnerable to identifying non-functional sites.

A natural way to improve specificity of TF binding site predictions is to make use of additional biological information in the inference. This idea has been proposed in several articles where, for example, promoter scanning with PSFMs has been constrained to only those parts of the genome that are highly conserved, i.e., conservation scores exceed a threshold (see reviews in

A number of additional information sources can be useful for predicting TF binding. First, functional binding sites are typically evolutionarily conserved

Second, in addition to evolutionary conservation, methods exist to assess whether a conserved sequence is neutral or functional. This more detailed information, often called regulatory potential, has a potential to distinguish neutral sequence regions from the functional ones, even within conserved parts of sequences. Regulatory potential scores (log-likelihoods) are obtained using ESPERR

Third, while evolutionary conservation can help in discriminating functional binding sites that are more prevalently located on conserved parts of the genome from presumably non-functional sites on non-conserved regions, it does not explain the mechanism by which a TF is guided to its functional site. A hypothesis is that this process is controlled by the intrinsic nucleosome organization of genomes

Although we primarily focus on the three additional evidence sources mentioned above, other information sources can also be directly included into our modeling framework. For example, a general sequence feature of many promoters, and thereby a feature of binding sites within promoters as well, is that they typically have a high CpG dinucleotide content

From a computational point of view, we assume that each additional data source is in the form

The intuitive rationale for defining the term that captures the additional data,

We extend the above framework to multiple data sources _{i}_{D}

The data fusion problem is illustrated in ^{(i)} to the background model φ, ^{(i)} are highly correlated whereas for other TFs, such as SP1, motif models produce “scores” which are distinct from each other. Finally, many of the annotated sites are also associated with a high probability of conservation and regulatory potential and with a low probability of nucleosome occupancy. This correlation is not expected to be perfect though since only about 50% of the functional binding sites are assessed to be conserved (see

(a) Annotated binding sites for SRF on Actc1 promoter. (b) Annotated binding site for SRF on M23768 promoter. (c) Annotated binding site for SP1 on Myod1 promoter. (d) Annotated binding site for TEAD1 on Myh6 promoter. Figure keys are as follows. θ^{(i)}: motif models for each TF, Conserv.: sequence conservation probabilities computed by PhastCons

Histograms of the estimated binding probabilities for the likelihood-based method when combined with (b) regulatory potential and (c) evolutionary conservation.

In these simulations, the scaling parameter for each additional data source (see ‘Combining multiple information sources’ Section for details) is chosen using a grid search over values

Given the above promising results with a single additional data source, a natural question then is to study whether combinations of additional data sources further improve TF binding prediction. We consider the combination of conservation and regulatory potential for which the ROC curve as well as the corresponding histogram of the estimated binding probabilities are shown in

(b) Histogram of the estimated binding probabilities for a combination of conservation and regulatory potential.

We used the same scaling parameters as above and tried a set of different weighting schemes and again chose the weighting parameter that gives the best AUC measure over grid _{2}∈{0.5, 0.52,…, 1}. We found _{1} = 0.14 and _{2} = 1−_{1} = 0.86 for regulatory potential and conservation, respectively (see ‘Combining multiple information sources’ Section for details). Similarly to scaling, weighting is not sensitive to small deviations in the values of the weighting parameters. Weighting some of the data sources more heavily just biases the results towards those obtained using the particular single data source alone.

The standard practice has been to constrain the scanning with PSFMs to only those regions of the genome whose conservation probability (or score) is sufficiently high

Evolutionary conservation and regulatory potential are the most informative additional data source in our simulations, whereas estimated nucleosome locations or CpG-islands do not improve TF binding predictions. As mentioned above, the particular estimated nucleosome data that we use might not be optimal for our mouse test set. Once a mouse nucleosome model or high-throughput nucleosome data become available, they can be used in the same framework as well, likely improving TF binding predictions. For example, Narlikar

Gene regulation in higher organisms commonly requires multiple TFs. Thus, combinatorial regulation by several TFs is another important problem to study. The main difference between a single TF and multiple TFs regulating a gene is that combinatorial regulation requires all TFs to have at least one binding site for (at least) one of their motif models. Although multiple regulatory proteins can also form a complex and the complex can regulate a target gene via a single binding site, we only consider regulation via multiple binding sites, the single binding site case being similar with our previous analysis. Statistical inference for combinatorial regulation can be naturally addressed in our probabilistic framework. For that purpose, we propose to use both the likelihood and Bayesian methods (see ‘Combinatorial regulation’ Section for more computational details).

Combinatorial regulation by multiple TFs is less well-known and fewer combinatorial annotated binding sites are reported in databases or even in the literature (see

The problem of inferring combinatorial regulation among many TFs becomes computationally expensive because there are

Histogram of combinatorial regulation probabilities for (b) the Bayesian method and (c) naive likelihood approximation.

As suggested by previous simulations, the detection of combinatorial regulation can be improved by incorporating additional data sources. We consider using evolutionary conservation for which the results are shown in

Histogram of combinatorial regulation probabilities for (b) the Bayesian method with evolutionary conservation and (c) a naive likelihood approximation with evolutionary conservation.

So far we have assumed that the direction of transcription is known and we have focused on analyzing only a single strand of the DNA. This is not always the case and, therefore, it is useful to generalize TF binding prediction methods such that they use both strands of the DNA. Our probabilistic methods generalize naturally to handle double-stranded DNA. This can be achieved simply by applying the aforementioned methods to both strands, either independently or simultaneously (computational details are described in ‘Single vs. both strands’ Section). To demonstrate performance of our methods on double-stranded DNA, we re-compute the results shown in

Our final simulation concerns inferring the probability of having a binding site at a single nucleotide position. Although our proposed methods are primarily designed to process each promoter sequence as a whole, it is also useful to be able to infer binding probabilities at a higher resolution, in particular, at each nucleotide position. Inferring the binding probabilities at each base pair location is more challenging from the computational point of view. In particular, the efficient recursive algorithm developed for the likelihood method cannot be applied. The inference can, however, be easily performed in our Bayesian framework using our MCMC sampler. The binding probability at each base pair location is achieved by integrating out all other locations (see ‘Binding probabilities at single nucleotide resolution’ Section more details).

Subplots (b), (d), (f) and (h) show the same results but with evolutionary conservation as the additional data source. The blue and red graphs indicate the start of the binding sites. The annotated binding sites are shown with gray vertical bars. These results correspond to

Encouraged by the above performance evaluations, we applied the proposed likelihood-based binding prediction method to the 2K base pair upstream promoter regions of all 20397 mouse genes, where the genomic locations of the promoters are based on RefSeq gene annotations. Evolutionary conservation was used as an additional data sources as explained above and binding specificities for 266 TFs were again taken from TRANSFAC Professional version 10.3. Prior to analyzing the promoter sequences, DNA repeats were found using RepeatMasker

Histogram frequency at bin value 10 in Figure (b) (resp. value about 5 in Figure (c)) includes all values that exceed 10 (resp. 5).

In our analysis, we primarily focused on estimating the binding probability of a TF to either the whole analyzed promoter or at single base pair resolution (using the Bayesian method). We also introduced an extension for inferring combinatorial regulation. Given the flexibility of our probabilistic (Bayesian) modeling framework, virtually any question can be answered probabilistically within it. For example, Beer and Tavazoie

One popular way of analyzing expression data is based on clustering similarly behaving genes together or finding groups of genes that are differentially expressed. The gene sets found are then typically searched for common (either known or unknown) sequence motifs. A potentially very useful extension of our framework will be to develop a method for computing the probability that a set of genes (or a fraction of them) have a binding site for a TF or for a set of TFs.

A number of other possible extensions are also easily included in this probabilistic modeling framework. For example, some proteins interact to form a heterodimer and bind as a complex, in which case the potential binding sites of (all or a subset of) the constituent TFs may be more likely to be physically close to each other. Incorporation of protein-protein interaction databases may help in revealing such mechanisms. Evolutionary conservation was included in the framework by utilizing the conservation scores of an input promoter. One can also simultaneously analyze the corresponding promoter in other organisms to check if they have a binding site for the same TF (in the corresponding location), see

Our final note is devoted to the general distinction between motif discovery and binding site prediction methods. The proposed Bayesian method interprets binding specificities as random variables whose prior parameters define the amount of uncertainty associated with each TF binding model. By gradually forcing all pseudo counts to be equal (unity), i.e., increasing the uncertainty, the Bayesian binding prediction method indeed turns into a pure motif discovery method. Thus, for uncertain binding specificities, the Bayesian method can also be used as a motif discoverer.

Our future work includes developing the framework in the direction of the aforementioned extensions. We are also extending our genome-wide analysis to yeast. Predicting TF binding in yeast is interesting not only because it is the most often considered model organism, but also because yeast has a well-developed nucleosome model

A central goal in the described computational analysis is accurate TF binding prediction from multiple data sources. However, because TF binding does not necessarily imply transcriptional regulation, it is also important to further extend computational methods to incorporate other, functional data, such as gene expression or protein level (time series) measurements. Statistical inference of transcriptional regulatory networks from a combination of gene expression time series, promoter sequence and binding specificity data has been studied in

We have developed a flexible and comprehensive framework for TF binding prediction from multiple data sources. The proposed methods are probabilistic in nature and, thus, directly assess our degree of belief in binding or non-binding in terms of probabilities. Instead of assessing TF binding at each nucleotide location separately, we extended the binding prediction methods to analyze each promoter sequence as a whole. This gives a more complete view of a TF binding to and possibly regulating a target gene. Although we primary focused on answering the question of whether the entire promoter has a binding site for a TF, we also developed a method for computing binding probabilities at each nucleotide position by essentially integrating out other locations in a promoter. Most importantly, the proposed methods can make principled inference from multiple data sources that can include, among others, multiple motif models, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites and ChIP-chip. Results on our carefully constructed test set demonstrate that principled data fusion can significantly improve the performance of binding prediction methods. Recent technological developments, such as protein binding microarrays, now allow accurate measurements of TF binding specificities to be gathered in a high-throughput fashion. Using accurate binding specificity measurements together with principled TF binding prediction methods can provide a competitive alternative to traditional condition specific ChIP-chip experiments, especially when TF binding prediction incorporates multiple additional data sources. Our genome-wide TF-DNA binding results for mouse indicate relatively sparse connectivity between TFs and their target genes, consistent with previous results. The probabilistic formulation of TF binding prediction is particularly useful for integrating our results as building blocks in other computational methods. To that end, we have also implemented a web tool, ProbTF, which allows users to analyze their own promoter sequences and additional data sources.

The computational methods are implemented in Matlab and will be made available as an open-source library upon publication. The test set, including all sequences and additional information sources, will also be made available on a supplementary web site. A preliminary version of the computational methods presented in Sections ‘Modeling framework’, ‘Likelihood approach: one motif model θ’, and ‘Likelihood approach: multiple motif models Θ’ have been reported in our previous conference article

Let _{1},…,_{L}_{i}_{1},…,_{c}

Non-binding background sequence locations are modeled by the commonly used _{i}_{−d+1},…,_{0}. We could alternatively define a separate probability distribution for the first

Motifs are modeled using the standard PSFM model θ which is a product of independent multinomial distributions _{i}

Using Bayes' rule, the probability of

As in _{1} =

The probability that a given TF (defined by θ) binds to a gene having promoter sequence

A TF can recognize several different types of binding sites and is then characterized by several motif models Θ = (θ^{(1)},…,θ^{(m)}) each having length ℓ_{i}^{c}_{i}_{i}

The probability of _{min} = {ℓ_{1},…,ℓ_{m}

The prior ^{(1)},…, θ^{(m)}) are independent. Indeed, it is likely that they are strongly dependent. Therefore, we use the same prior

Let Θ→

The above probabilistic modeling framework that incorporates multiple motif models can be viewed as an extension of a framework proposed in

TF binding specificities are derived from experimental data sets, some of which have extremely small sample sizes (as low as five reported binding sequences). PSFM models can therefore contain a considerable amount of uncertainty. Instead of assuming motif models θ to be known exactly, as above, it is useful to take the uncertainty in the motif models themselves into account. This can be done naturally in a Bayesian setting where the parameters/models are considered as random variables. We describe the Bayesian methods directly for the case of multiple motifs. The single motif case can be obtained as a special case by setting

Using Bayes' rule, the probability of motif positions

The marginal likelihood

TF-DNA binding databases typically provide information in the form of “the number of times a TF has been observed to bind a given sequence.” These sequences are also aligned and aligned counts are summarized in position specific weight matrices (TRANSFAC, JASPAR), which we denote as α_{ij}^{(k)}. Similar counts can also be obtained for the background model (denoted by α_{ij}^{(0)}) from genomic sequences that do not contain (known) binding sites. Therefore, it is natural to use a Dirichlet prior for the parameters, which is defined by so-called pseudo-counts.

Let us rewrite the motif model parameters (independent multinomial distributions) for now as θ^{(k)}(_{ij}^{(k)} which again defines the probability of seeing nucleotide _{k}_{j}^{(k)} = {θ_{ij}^{(k)}|_{j}^{(k)} with hyperparameters α_{ij}^{(k)} is defined as_{ij}^{(k)}≥0, ∑_{i}_{ij}^{(k)} = 1, α_{ij}^{(k)}>0, and Γ(·) is the Gamma function. Priors for different

The Dirichlet prior is also a conjugate prior for multinomials. Consequently, the marginal likelihood has a closed-form solution. Let _{ij}^{(k)} denote the number of times nucleotide _{j}^{(k)} = ∑_{i}_{ij}^{(k)} and _{j}^{(k)} = ∑_{i}N_{ij}^{(k)}. The marginal likelihood can be written as

To keep likelihood-based and Bayesian approaches comparable, we use the same prior here, i.e._{i}

Because some of the motif models are remarkably diffuse (computed from only a few example sequences), we do not use the PSWMs as pseudo counts directly. We instead use a version that incorporates a so-called prior strength term

This prevents a (single) sequence to have too strong of an influence on the posterior parameter values. For simplicity, we use the same prior strength for all the motif models, although this does not need to be the case in general. In addition, we add a small number (one) to each α_{i}_{,j}^{(k)} to prevent zero entries. Finally, to preserve comparability between likelihood and Bayesian approaches, the normalized motif models for the likelihood based approach are computed from the recomputed pseudo-counts used in the Bayesian estimation, i.e.,

Recall that in the Bayesian framework, the mean of θ^{(k)}(_{j}^{(k)}|α) is equal to the quantity in Equation (18).

Unfortunately, there is no efficient recursive formula to compute the probabilities

The MH algorithm is completely specified by a proposal distribution

Motif addition with probability ^{(i)}∈Θ, propose a new, non-occupied/non-overlapping motif position uniformly randomly (if a free location exists).

Motif deletion with probability 1−

We use ^{(B+1)}, π^{(B+1)}), (^{(B+2)}, π^{(B+2)}), …, (^{(B+N)}, π^{(B+N)})) is collected.

Although the chain is ergodic as shown above, it is important to monitor convergence of the MCMC algorithm for finite samples to guarantee the desired output. Bayesian inference in this case can be considered as a model selection problem where the model space consists of all valid pairs (_{|A| = c} ∑_{π} _{1}-distance between two independent estimates of ^{5} for the burn-in and

As above, the quantities of interest include the probability of having at least one binding site for at least one of the motif models, denoted as Θ→

From the point of view of modeling transcriptional regulatory networks, it is also important to study combinatorial regulation by several TFs Ω = (Θ^{(1)},…, Θ^{(p)}), each with a set of motif models

In other words, the set

Probabilities _{1}+…+_{p}_{i}_{Ω}, as

There is no efficient formula to compute ^{(i,j)} is now added uniformly randomly from a list of motif sets Ω. Finally,

We also consider a naive (likelihood-based) approximation that estimates the probability of combinatorial regulation by the product of individual binding probabilities, i.e.,

TF binding predictions can be significantly improved by incorporating multiple additional data sources, such as evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites or ChIP-chip, into our probabilistic inference framework. Let

We model the probability of

The factorization in Equation (22) is useful as it allows us to write

In particular, note that in the likelihood based approach the same efficient recursive formula as in ‘Likelihood approach: multiple motif models Θ’ Section can be applied to compute

In a Bayesian setting, data fusion can be performed similarly

If we have several additional data sources _{k}_{k}_{k}_{D}_{k}

Conceptually, the probability

In practice, we do not use the above probabilities directly but a scaled version of them. For the _{k}_{k}_{k}_{k}_{k}_{k}_{k}_{k}_{k}

So far we have only considered using a single, either forward or reverse, strand for inferring TF binding. Extending the above methods to double-stranded DNA is straightforward. Let a promoter be denoted now as _{S}_{′} and _{S}_{″}. Now

Due to the OR-type of event (Θ→_{S}_{′} and _{S}_{″}. A similar extension works for the Bayesian case as well with the exception that

Our computational methods are primarily designed to answer the question of whether the whole promoter has a binding site for a given TF. However, it is also useful to be able to infer binding probabilities at higher, single nucleotide, resolution. A motif position and configuration pair (

We compare our proposed probabilistic method with traditional promoter scanning

The data set consists of a merger of annotated TF binding sites for mouse from the ABS

The promoters are generally upstream of the genes that they are associated with. However, some regions stretch over the first exon into intronic regions. Therefore, some promoters have exons in their sequence. There were around a dozen overlapping promoters between the two databases. The TF binding sites for these promoters were merged onto the longer 2K ORegAnno sequences. The data set also includes 250 upstream, non-coding sequences that can be used for generating background models and statistics.

The test set used in our simulations consists of 47 promoter sequences, each having a varying number of annotated binding sites (

For all the simulations, we use (scaled) motif models from TRANSFAC. Parameters of the Markovian background models (model orders 0,1,…, 4 are tested) are estimated from the 250 negative sequences (both strands).

(0.03 MB DOC)

ROC curves for the likelihood-based probabilistic method (red), traditional scanning (blue), and a probabilistic scanning-based method that outputs a probability of binding (green) for the case where promoter sequence lengths have not been made equal. Background model order is (a) d = 0 and (b) d = 1.

(0.49 MB TIF)

ROC curves for the likelihood-based method (blue) when combined with a single additional information source: regulatory potential (red), and evolutionary conservation (green). Solid graphs (resp. dashed graphs) correspond to the optimized parameters (resp. results obtained with stratified cross-validation).

(0.33 MB TIF)

ROC curves for the traditional scanning (green), traditional scanning combined with thresholded conservation information (blue), probabilistic method combined with conservation information (red), and probabilistic method (cyan) for the case where promoter sequence lengths have not been made equal.

(0.34 MB TIF)

ROC curves for the likelihood-based method (blue) when combined with a single additional information source: evolutionary conservation (green) and regulatory potential (red). Promoter sequences that are used to train the regulatory potential method and that also overlap with our test set have been removed.

(0.33 MB TIF)