^{1}

^{2}

^{1}

^{1}

^{3}

^{*}

RS, EDS, and EvN conceived and designed the experiments. RS and EvN performed the experiments and analyzed the data. RS, EDS, and EvN contributed reagents/materials/analysis tools and wrote the paper.

The authors have declared that no competing interests exist.

A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five

Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.

In this work the authors present an algorithm, PhyloGibbs, that combines these two approaches into one integrated Bayesian framework. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.

The authors perform a number of tests on synthetic data and real data from Saccharomyces genomes in which PhyloGibbs significantly outperforms other existing methods. Finally, a novel anneal-and-track strategy allows PhyloGibbs to make accurate estimates of the reliability of its predictions.

Transcription factors (TFs) are proteins that bind in a sequence-specific manner to short DNA segments (“binding sites”), most commonly in intergenic DNA upstream of a gene, to activate or suppress gene transcription. Their DNA-binding domains recognize collections of short related DNA sequences (“motifs”). One generally finds that, although there is no unique combination of bases that is shared by all binding sites, and although different bases can occur at each position, there are clear biases in the distribution of bases that occur at each position of the binding sites. A common mathematical representation of a motif that takes this variability into account is a so-called weight matrix (WM) [_{αi} give the probabilities of finding base α ∈ {A, C, G, T} at position

There are several algorithms that are based on the WM representation that detect, ab initio, binding sites for a common TF in a collection of DNA sequences [

A crucial factor for the success of ab initio methods is the ratio of the number of binding sites to the total amount of DNA in the collection of sequences. That is, the larger the number of binding sites in the set, and the smaller the total amount of DNA, the more likely it is that ab initio methods can discover the binding sites among the other DNA sequences. In order to ensure a reasonable chance of success one thus needs to provide these methods with collections of sequences that are highly enriched with binding sites for a common TF. One possibility is to use sets of upstream regions from genes that appear co-regulated in microarray experiments (e.g., [

This latter approach is in general complicated by a number of factors. When searching for regulatory sites in sequences that are not phylogenetically related, such as upstream regions of different genes from the same organism, one may simply look for short sequence motifs that are overrepresented among the input sequences. If the set of species from which the orthologous sequences derive are sufficiently diverged, one may simply choose to ignore the phylogenetic relationship between the sequences and treat the orthologous sequences in the same way as sequences that are not phylogenetically related. This was, for instance, the approach taken by McCue et al. [

However, this approach is not applicable to datasets containing more closely related species, where some of the sequences will exhibit significant amounts of similarity simply because of their evolutionary proximity. Moreover, the amount of similarity will depend on the phylogenetic distance between the species, and it is clear that finding conserved sequence motifs between orthologous sequences from closely related species is much less indicative of function than finding sequence motifs that are conserved between distant species. One will in general thus have to distinguish conservation due to functional constraints from conservation due to evolutionary proximity, and to do this correctly, the phylogenetic relationship between the sequences has to be taken into account.

A second challenge in using orthologous intergenic sequences from multiple species is the nontrivial structure of their multiple alignments. One typically finds a very heterogeneous pattern of conservation: well-conserved blocks of different sizes and covering different subsets of the species are interspersed with sequence segments that show little similarity with the sequences of the other species.

The technique of “phylogenetic footprinting” (e.g., [

We thus decided to retain the entire patchwork pattern of conserved sequence blocks and unaligned segments. Our strategy is implemented by a Gibbs sampling approach, and a preliminary account of the algorithm was presented in [

Recently a number of other algorithms have been developed that search for regulatory motifs in groups of phylogenetically related sequences. Probably the first algorithm that was proposed is a generalization of the Consensus algorithm [

More closely related to PhyloGibbs's approach are two recent algorithms [

In the following sections, we first describe our Bayesian model that assigns a posterior probability to each configuration of binding sites for multiple motifs assigned to the input sequences. We start by describing the model for phylogenetically unrelated sequences, which is essentially equivalent to the model used in the Gibbs motif sampler [

We then present examples of the performance of ours and other algorithms on both synthetic and real data. The synthetic datasets consist of mixtures of WM samples and random sequences, which is in accordance with assumptions that all algorithms make. This allows us to compare the performance of the algorithms in an idealized situation that does not contain the complexities of real data. These tests also show to what extent binding sites can be recovered for this idealized data as a function of the quality of the WMs, the number of sites available, and the number of species available and their phylogenetic distances. For our tests on real data we use 200 upstream regions from

In order to motivate and explain our model for phylogenetically related sequences it is helpful to first introduce the model for sequences that are not phylogenetically related. In this context, “not phylogenetically related” means that for any pair of sequences in the input data, their common ancestor sequence is sufficiently far in the evolutionary past that mutations have been introduced multiple times at each position in the sequences. That is, any similarity left between the input sequences cannot be due to evolutionary proximity.

We assume that our data contain an unknown number of sites for an unknown number of different TFs. The state space of possible solutions to the problem of identifying the binding sites contained in these sequences consists of all possible ways in which one can assign groups of binding sites to these sequences. An example of such binding site assignments, which we call “configurations,” is shown in

A window, in our terminology, is a possible binding site for a TF; in the case of phylogenetically unrelated sequences it is simply a set of

Assuming that the width of the binding sites is

Given a dataset of input sequences

where

where we denote all sequence that is colored zero in the configuration _{c},_{c}_{c}_{c}_{c}_{c}

When considering datasets that contain phylogenetically related sequences, such as orthologous intergenic regions from related species, the main problem is distinguishing sequence similarity that is due to evolutionary proximity from sequence similarity that arises from functional constraints. That is, when calculating the probability

Our strategy is thus to first produce a multiple alignment and then search the space of binding site configurations that are consistent with this alignment. Standard global multiple alignment algorithms [

Once we have a syntenic multiple local alignment, we treat columns of aligned bases as phylogenetically related, i.e., arising from a common ancestor base. The state space again consists of all possible configurations of binding sites but now with the constraint that “windows” that include aligned bases have to extend over all sequences in the alignment. That is, we assume that if a binding site occurs in a sequence segment that is aligned with sequence segments from the other species, then binding sites for the same TF have to occur in the corresponding positions of these other sequence segments. To this end we extend the concept “window” (denoting a position of a potential binding site) to multiple local alignments, as illustrated in

Vertically aligned capital letters are phylogenetically related bases, assumed to have evolved from a common ancestor. Thus, any window placed on these bases is extended to cover all related bases. Three legitimate windows are surrounded by solid boxes. The window surrounded by the dotted box is illegitimate because the gap in the top sequence makes the alignment of bases inconsistent. Note that lower case letters are not aligned and that, in order to complete a window with aligned sequences, one may slide lowercase bases “through” adjoining gaps. For example, if the window on the bottom two sequences were to move two steps to the left, the “c” and “a” on the left side of the preceding gaps would slide through the gaps to the right to complete the window.

The figure shows a sample stretch of four aligned sequences, where uppercase letters are aligned and lowercase letters are “independent.” In an initial pass the program identifies the set of all legitimate “windows” in the entire sequence data. Each of these windows may encompass one or more sequences. The windows must contain consistently aligned uppercase letters: there should not be “gaps” that give inconsistent spacing between aligned uppercase letters. For example, in

Next, we need to generalize our probabilistic model to multiply aligned orthologous sequences. For the single-sequence windows of the previous section, the probability

with _{i}_{i}_{i}_{i}_{i}_{αi}. Since the positions are mutually independent we have for the whole window

The probability _{c}

and the probability _{c}_{c}_{αi} are replaced with the background probabilities for the bases in each column. Detailed derivations and explicit expressions are provided in

The last two sections have explained the posterior probability

The most important of the moves in our move set is the “window shift” move, which takes a single window and resamples its position. Since this type of move is generally referred to as Gibbs sampling, i.e., one samples a joint probability distribution by resampling one variable of the joint distribution at each time step while keeping the other variables fixed, and because of the similarities with the original Gibbs motif sampling algorithm [

By repeating moves from the move set described in the previous section PhyloGibbs will, in the limit of long time, sample each configuration

Our current strategy is to first search for the configuration ^{β}. Initially we set β = 1 and slowly increase β with time until the sampler “freezes” into a configuration

The annealing phase is followed by a tracking phase in which we sample from the distribution

In its default mode of operation PhyloGibbs reports the following results for the input set of sequences _{min}, with _{min} a cutoff that the user can specify, and (4) a WM for each color

Note that, in general, different member windows

As far as we are aware, PhyloGibbs is the only motif-finding algorithm that rigorously assigns posterior probabilities

As we show below, only the rigorous sampling of the space of all configurations as implemented in PhyloGibbs is capable of assigning realistic posterior probabilities to the sites it reports. It is sometimes attempted to identify “significant” motifs by simply rerunning one or several different motif-finding algorithms and looking for recurring motifs. However, this merely generates the subsidiary problem of clustering the multiple predictions. Instead of using ad hoc scoring schemes for clustering, reported binding sites should ideally be clustered using the same probabilistic scoring that generated them, i.e., as in [

In general there are three qualitatively different issues that contribute to the performance of motif-finding algorithms on real data. First, all motif-finding algorithms make assumptions about the data that will ignore at least some of the complexities of real data. The performance of a given motif-finding algorithm will depend on the extent to which these ignored complexities affect the algorithm's ability to perform its task. Second, the search spaces of all possible WMs or all possible binding site configurations are too large to search exhaustively, and therefore all algorithms employ heuristic methods to search for the globally optimal WMs or configurations. The extent to which the heuristic methods succeed or fail will also affect the performance of the algorithms. Third, even if the data adhere to all assumptions that an algorithm makes, and the algorithm successfully finds the global optimum in the search space, this still does not guarantee that the algorithm will recover the correct motifs and sites. That is, if the motifs are fuzzy and the sites are embedded in long background sequences it might occur that, by chance, the background contains sets of sites that are more conserved and more similar than the embedded sites. In this case it will be impossible for any algorithm to recover the true sites.

By generating synthetic data to accord, as much as possible, with the assumptions that the motif-finding algorithms make, we can study the second and third issues separately from the first issue. In this section and the next we do a number of such tests. In our first test we want to evaluate to what extent PhyloGibbs can recover a fixed number of sites embedded in a perfect alignment of orthologous sequences as the quality of the WMs and the phylogenetic distances of the orthologs are varied. At the same time, we want to test how well PhyloGibbs performs when operating on perfect alignments compared to algorithms that do not take phylogenetic information into account and that cannot operate on multiple alignments (including PhyloGibbs in the mode where it ignores phylogenetic information). This test will indicate how much performance can be improved by using phylogenetic information and multiple alignments in an ideal situation. For ease of reference, from now on we refer to all algorithms that use phylogenetic information and that can operate on multiple alignments as “phylo” algorithms, while referring to algorithms that treat all sequences as independent as “non-phylo” algorithms.

For our first test we generated synthetic datasets as follows. (1) We first generated a WM of width _{a}, w_{c}, w_{g}, w_{t}

We compared the performance of PhyloGibbs with those of non-phylo algorithms on alignments of

PhyloGibbs with phylogeny (red), PhyloGibbs in non-phylo mode (light blue), WGibbs (dark blue), and MEME (pink) were run on alignments of

All algorithms assume that the data are a mixture of random uncorrelated background sequences and samples from a number of WMs of certain lengths. With the exception of the phylogenetic relationship of the sequences, which is ignored by the non-phylo algorithms, the synthetic data are thus in complete accordance with the assumptions that each of the algorithms make. For each algorithm we specified the correct length and number of sites. Since when using PhyloGibbs with phylogeny the windows extend over all five sequences in the alignment, we asked PhyloGibbs to predict four multi-sequence windows for a single motif, while we asked the non-phylo algorithms to search for 20 single-sequence sites for a single motif. Since for any algorithm, the performance differs substantially between input datasets that were generated with the same parameter settings, we averaged results over 50 datasets and in

All non-phylo algorithms, including PhyloGibbs when phylogeny is turned off, perform roughly equally well (or badly). For highly polarized WMs all non-phylo algorithms perform quite well. In contrast, for low polarizations (

It is important to point out that PhyloGibbs's superior performance for these data is partly due to the fact that the five sequences have been perfectly aligned and that it is searching only through configurations that are consistent with this alignment. In contrast, the non-phylo algorithms treat the five sequences as independent and have to search a much larger space of configurations. For real data we of course do not have perfect alignments and it will generally be hard to obtain good alignments when the proximity

In

It might also be asked if the non-phylo algorithms are put at a disadvantage by the fact that they have to search for a much larger number of sites. That is, to get a 50% performance PhyloGibbs needs only to get two multi-species sites correct, whereas the non-phylo algorithms need to get ten sites correct. To test this we ran MEME and WGibbs on single sequences, as opposed to groups of

Although PhyloGibbs performed consistently better than the non-phylo algorithms, in many cases it recovered only a fraction of the embedded sites. Since the synthetic data were generated exactly according to the model that PhyloGibbs assumes, there are only two possible reasons for the failure of PhyloGibbs to recover the embedded sites. The first possibility is that the correct configuration, i.e., with the four binding sites occurring at the positions where they were embedded, is the globally optimal binding site configuration, but that the anneal phase failed to identify it and instead settled on an only locally optimal configuration. In that case the posterior probability _{cor}|_{cor} should be higher than the posterior probability _{cor}|

To investigate how often the anneal in PhyloGibbs identifies the globally optimal configuration, we compared _{cor}|_{cor}|

In conclusion, these first tests with synthetic data showed that, when sites from WMs are embedded in a random ancestor sequence, and PhyloGibbs is given a perfect alignment of a set of descendants of this sequence, it performs significantly better than algorithms that treat the descendants as independent sequences. It also shows that as the similarity between sites becomes less than or equal to the similarity between orthologous sequences due to evolutionary proximity, it becomes impossible for any algorithm to accurately recover the sites.

In the first test PhyloGibbs used both information from the overrepresentation of a motif in the data and information about the conservation of its sites. We next investigated how many species one would need, in an ideal situation, to reliably infer the location of a single binding site using conservation only. To test this we generated synthetic intergenic regions of length

The solid line shows the average overlap between the true site and the predicted site and the dotted lines show two standard errors.

We see that more than ten species are needed to have a 50% probability to recover a single site of a random WM at

In the next section we compare the performance of PhyloGibbs and other algorithms that use phylogeny (PhyME [

As mentioned in the previous section, our synthetic data are in accordance with all assumptions that the non-phylo algorithms make about the data, except of course for the phylogenetic relationships between the sequences. Our synthetic orthologous sequences are generated in accordance with the evolutionary model that PhyloGibbs and PhyME assume. For these two algorithms the synthetic data are thus in exact accordance with the assumptions that these algorithms make. EMnEM employs an evolutionary model that uses the same substitution matrix both within and outside of sites, but allows each position in a binding site to evolve at a different overall rate. This model is thus less realistic than the model that PhyME and PhyloGibbs use in that it ignores that the probabilities of different substitutions within a site depend on the site's WM. However, since it has more free parameters that can be fitted, we suspect that in practice it will be able to reasonably approximate the evolutionary model that PhyloGibbs and PhyME use.

We followed the estimates of [

A total of 250 alignments of

The algorithms that ignore phylogeny did not recover more than 16% of the true sites (sensitivity), and did so with a nearly fixed specificity of around 20%, meaning that there is not much enrichment in true sites in the top versus the bottom of their ranked lists. The algorithms that exploit the phylogeny all did better for the simple reason that they operate on the perfect multiple alignments and therefore their search space is much smaller. Of these three algorithms PhyloGibbs performed best. PhyME, in common with the non-phylo algorithms, reports a very limited range of posterior probabilities for the sites it reports, which leads to a relatively small “dynamic range” in sensitivity/specificity.

Note that even with this perfectly aligned data, to get 50% of the true sites, PhyloGibbs needed to make more than twice as many predictions. Again, to determine to what extent the failure of PhyloGibbs to recover all the embedded sites was caused by the anneal getting trapped in locally optimal configurations, we compared the posterior probabilities _{cor}|_{cor}|

These synthetic data also provide the opportunity to test how well the algorithms assess their reliability, i.e., how well the reported posterior probabilities for their predictions match the specificities (fraction of predictions matching true sites) we compute by knowing the true sites. Ideally the two are the same, so that for real data one could use the posterior probabilities to gauge the fraction of correct predictions. The right panel of

Both EMnEM and PhyME overestimated their specificity because they calculate their posterior probabilities for sites under the assumption that the WMs that they infer are correct. In reality, the inferred WM will often not match the true WM that generated the data. For WGibbs the overestimation stems from the restricted sampling of configurations around the one that gave the maximum posterior probability during sampling. Only PhyloGibbs bases its posterior on sampling of the whole space of binding site configurations.

In

In summary, these tests have shown that, given perfectly aligned input sequences, all phylo algorithms substantially outperform non-phylo algorithms. The tests have also shown that, on data that are in accordance with the assumptions that the phylo algorithms make (almost all for EMnEM), PhyloGibbs outperforms the other algorithms. In addition, only PhyloGibbs is capable of reasonably estimating the reliability of its own predictions.

To test the performance of PhyloGibbs and other algorithms on real data we decided to use data from the recently sequenced yeast genomes [

For each of the 200 genes, we gathered its upstream region together with the orthologous upstream regions from

The left panel of

The left panel shows how the fraction of predicted sites that match true sites (specificity) depends on the fraction of true sites that are among the predictions (sensitivity) for PhyloGibbs (red), EMnEM (yellow), PhyME (green), PhyloGibbs without phylogeny (light blue), WGibbs (dark blue), and MEME (pink). Dashed lines correspond to one standard error. In order for the specificities, predicted by the various algorithms, to match the true specificities, we have to assume that the known sites are only a fraction of all true sites. The right panel shows what the fraction of known sites among all true sites should be in order for the algorithms' predicted specificities to match the true specificities. The black line shows an independent estimate of the fraction of real sites in these upstream regions that is documented (see text).

We believe that one important factor contributing to the smaller difference between the phylo and non-phylo algorithms is the limited reliability of the multiple alignments. Since all phylo algorithms only sample configurations consistent with the alignment, any errors in the alignment will hurt their performance. Another factor that probably plays a role is that all phylo algorithms assume that when a site occurs in a conserved block, the site must occur in all species. This is probably not always true, i.e., there are cases where only some of the sequences in an aligned block have retained the site. The non-phylo algorithms can easily deal with this by placing windows only on those sequences that have retained the site, but the phylo algorithms cannot, and a block with several binding sites may be “spoiled” by a single sequence that is missing the site.

All specificities in _{r}_{t}_{d}_{t}_{p}

with _{r}_{d}_{p}_{d}_{p}_{d}_{d}

All binding sites that PhyloGibbs predicted in the upstream regions of the genes with one or more sites in SCPD are listed in

In the previous sections PhyloGibbs inferred the locations of regulatory sites in one intergenic region at a time. Although sites for a given TF often occur in multiple copies in a single intergenic region, there are also many cases where only a single site occurs, and in those cases PhyloGibbs has to rely on conservation alone to infer the locations of the regulatory sites. However, PhyloGibbs is not limited to run on a single multiple alignment of orthologous intergenic regions but can also run on a set of multiple alignments for co-regulated genes, which should significantly increase sensitivity and specificity.

To test the performance of PhyloGibbs in this setting we used data from a recently published [

We tested PhyloGibbs on the highest confidence set of intergenic regions regulated by each factor. We focused on the 45 TFs that had the fewest binding sites annotated in [

We first tested whether, in contrast to the motif-finding algorithms employed in [

Results of PhyloGibbs on Collections of Intergenic Regions for 21 TFs for Which the Motif-Finding Algorithms in [

We evaluated the results that PhyloGibbs reported for each TF in various ways. As described in

We see that for 16 of the 21 TFs, PhyloGibbs found a motif that matched, according to at least one statistic, the consensus motif known for this TF in the literature. PhyloGibbs thus apparently outperformed all of the motif-finding algorithms used in [

The results for PUT3 seem paradoxical. All sites PhyloGibbs reported matched sites reported in [

For MET31 the WMs matched reasonably well, and two out of three sites in configuration

For ADR1 and MAC1, both reported WMs showed a significant match to the literature motif but the reported sites overlapped only marginally with the sites reported in [

For HAP5 and SKO1, only the anneal WM matched the literature WM. Although a reasonable number of windows occurred on average during tracking for these motifs, there was no stable core. Even the stablest window in each group was only present about 50% of the time. The membership of these groups thus fluctuated significantly during tracking, and this is reflected in the fact that the information score (see

For GZF3 and RLM1 there was only a moderate match of the anneal WM to the literature WM, and no overlap whatsoever of the reported sites with the sites reported in [

Finally, there were five TFs (DAL80, MOT3, ROX1, YAP6, and YOX1) for which PhyloGibbs did not find any motif matching the literature motif among the intergenic regions from [

Hongay et al. [

Linde and Steensma [

For YAP6 a consensus binding site has been established by in vitro studies of different YAP proteins binding DNA [

Finally, Pramila et al. [

Results of PhyloGibbs on Collections of Intergenic Regions for 24 TFs for Which the Motif-Finding Algorithms in [

The protein SPT2 is involved in regulation of chromatin structure and is known to interact directly with the SWI/SNF complex and with histones. SPT2 has been reported to not have any sequence specificity in its DNA binding [

In summary, PhyloGibbs, when run on the highest quality intergenic regions and their orthologs reported in [

Detailed comparisons of PhyloGibbs's results with the annotations of [

Motif discovery algorithms make use of a variety of different kinds of information to identify binding sites for regulatory factors in intergenic DNA. Sequence specificities for particular regulatory factors can sometimes be obtained through detailed experimentation, including DNaseI footprinting and SELEX experiments. Weight matrices representing the sequence specificities can then be used to locate putative binding sites for these regulatory factors. In this respect algorithms often look for combinations of binding sites for several WMs [

In this paper we have presented a novel algorithm for ab initio discovery of regulatory sites that combines the search for overrepresented motifs with the analysis of sequence conservation in arbitrary collections of sequences and their orthologs. A major challenge in using orthologous sequences is distinguishing conservation due to functional constraints, such as regulatory sites, from conservation simply due to evolutionary proximity. In order to do this correctly one has to determine which sequence segments have evolved from a common ancestral segment, i.e., the sequences have to be aligned, and their phylogenetic relationships have to be taken into account. This is complicated by the fact that orthologous intergenic sequences typically cannot be trivially aligned but show a complex pattern of conserved blocks interspersed with unalignable segments. Moreover, regulatory sites are not necessarily restricted to the conserved blocks.

Focusing only on the conserved blocks, as is done in phylogenetic footprinting approaches [

Recently, two algorithms [

Another difference, also related to the prior

An important novel feature of our algorithm is the anneal-and-track strategy. The algorithm first uses simulated annealing to search for the configuration

In some approaches multiple runs of one or more algorithms on the same data are used to assess motif significance. However, in order to assess which motifs recur in multiple runs, results from the different runs have to be clustered, and the only way to do this correctly is to use the same sampling method as was used to extract the motifs in the first place. Our tracking strategy circumvents the need for such post-processing of the results.

Our tests with synthetic data showed that, in the idealized situation where orthologous sequences are perfectly aligned, algorithms that take phylogeny into account significantly outperform those that do not (see

We used intergenic regions of

We also ran PhyloGibbs on collections of intergenic region alignments of genes that were annotated in [

There are several issues that we intend to address in future extensions of the algorithm. First of all, we intend to extend the types and specificity of the priors that we allow. For example, when running on multiple alignments of several different upstream regions, one may sometimes have prior information that

The most useful quantity characterizing the “quality” of a WM

where _{α} is the background probability of base α, and the logarithm is often calculated base 2 to express the information score in bits. Many relevant quantities regarding sets of binding sites can be expressed in terms of information scores. For instance, the fraction of random sequences of length ^{−I}_{i}_{α} = _{i}_{α} is approximately ^{−nI}^{nI}

The simplest prior over configurations, representing “complete ignorance,” is the uniform prior,

with

The probability _{c}_{c}

with _{αi} being the number of times that base α occurs at position _{c}

where _{αi} ≥ 0 and

where the γ parameter, which is generally referred to as a pseudocount, can be set by the user (default is γ = 1). With this prior the integral can be done exactly, and we obtain

with

We will assume that the background sequence was generated by a Markov model of order _{i}_{i − k}_{i −}_{1}. We estimate the probabilities _{i}_{i −}_{1}…α_{i − k}

where _{i − k}_{i}_{i − k}_{i}

where the product is over all positions

It is conceptually and computationally convenient to divide the probabilities _{0}) of the configuration _{0} in which all windows are color zero (i.e., background). The factor

For each color _{αi}

At the start of each run, PhyloGibbs determines the set of all legitimate windows in the data. That is, it finds all locations where a window of length

Our model for the evolution of binding sites assumes that all bases mutate at a constant rate γ. When a base at position _{αi}. Under this simple model, the probability that a base at position

where we have introduced the “proximity” ^{−γt} that no mutation took place at this position during time

Note that as _{αi} of observing an independent base α at position

To calculate the probability _{i}_{i}_{i}_{i}_{i}_{i}_{i}

For a star topology, the probability _{i}_{i}_{i}_{i}

where _{j}_{j}_{αi} to the possibility that the ancestor had base α at position

The expression _{i}_{i}_{αi}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

For each aligned window, we approximate the window column function _{i}_{i}

(in this section we drop the position index

Therefore, using this form we can easily calculate the integrals for an arbitrary number of windows.

We now choose to set the parameters _{α} such that the first moments of the distribution

where

for all _{α}/λ but leave the overall scale λ still free. We use λ to approximate the second moments. We thus demand that the exact second moments

match the second moments of the approximation

as “closely” as possible. This could, for instance, be done by choosing λ such that the square-deviation is minimized. In the current implementation we set λ by, for every combination of α and β, solving for λ from the equation 〈_{α}_{β}〉_{e}_{α}_{β}〉_{a}

We then set λ equal to the average of the λs that are obtained from this equation for the 16 combinations of α and β.

In calculating the parameters λ_{α} and

Finally, it is clear that by demanding that we approximate the function _{α} and _{α} and

The above method of treating a star-topology phylogeny can be readily extended to deal with more general situations. A completely general phylogeny (assuming no “lateral transfer” of DNA) can be represented as a tree; the root is the last common ancestor of all given species, nodes are intermediate ancestors (last common ancestor of some, but not all, given species), and the leaves are the actual species under consideration. All unknown ancestors (root and nodes) are separately summed over. Proximities

Consider such a phylogenetic tree that does not have a star topology (i.e., contains internal nodes other than the root). At least one of the intermediate nodes must be such that all its children are leaves. Let the unknown base for this node at column _{β} to the total probability, given by

where the full expression would contain other factors involving α as well as a sum over α; _{β} is the proximity of β to its immediate ancestor α, the product runs over children of β indexed by _{n}_{n}

Substituting this into equation 28, we get two terms:

The first term simply removes the node β and attaches all its children to α (with unchanged proximities). The second term—identical, apart from a prefactor, to equation 19—can be treated as an independent factor to anything it multiplies, completely decoupled from the sums over α and other ancestors. In other words, with probability _{β}, base β is the same as α and all its leaves can be attached directly to α, and with probability 1 − _{β}, base β is mutated from α and can be treated as a new, independent ancestor for all of its descendants, disconnected from the rest of the tree.

By repeating this process, one can reduce any tree to a sum of products of star-phylogeny subtrees with appropriate prefactors. PhyloGibbs then applies the monomial approximation described in the previous section to each of the star-phylogeny subtrees, as well as to the final sum. Note, however, that the number of terms involved may grow exponentially with the number of species. As the number of species becomes large we thus need to make additional approximations to make this procedure computationally feasible.

A single time step of the algorithm consists of a “cycle” of a fixed number of moves of each of the types outlined in the following paragraphs.

Window-shift moves preserve the total number of colors, and the total number of colored windows (but may redistribute the windows among existing colors). We choose one of the presently colored windows at random. If it is the only one in its color, we make no operation (but to ensure detailed balance we update the time counter by one). If it is not the only window in its color, we color it zero (i.e., deselect it), and choose a new window from all of the available color-zero windows (including the window we selected) to replace it. The new window can have any of the existing colors, not necessarily the same as the window it is replacing. This move is computationally expensive, since if there are

Color-change moves allow for changes in the number of windows and the number of colors, while satisfying detailed balance. We select any of the existing windows, including color-zero windows. If the chosen window overlaps a non-zero-colored window then this window is blocked and we make no operation (but update the time counter). Otherwise, we reassign a color to the window, which may be zero, one of the existing colors, or a new color. Note that if the window was the only one in its color, a “new color” means “the same color as before.” The window-shift moves are not ergodic by themselves because they stay inside a subspace of fixed

With the previous two moves it is possible for the sampler to get stuck in a local optimum where the windows in a given color are all shifted by an equal amount from their best location. The global shift move addresses this problem. This move picks a color at random, and samples all ways of coherently shifting every window in that color by a fixed amount without “colliding” with an already-colored window.

Maskbit-flip moves are the final move type. Long motifs tend to be fuzzy, and not every position is sharply defined. Sometimes, the score of a collection of sites can be improved by scoring a subset of its columns according to the background model rather than assuming they derive from a WM. We thus allow the “masking” of certain columns, comparing whether or not the overall score is improved by scoring them according to background. For each color we maintain a mask, and sample over the states (zero or one) of the mask bits. In our experience, allowing such masking can increase performance for long motifs that contain nonconstrained sequence, such as occur in bacteria when TFs bind as dimers. However, for short motifs the enlargement of the configuration space that is associated with these masks may result in poorer discrimination.

After each cycle during the tracking phase, the best-matching color

After shifting all the windows in

Note that this corresponds to the maximal amount of overlap between the sites in _{min} sorted from large to small

For

As a measure of performance we took the fraction of all the bases in real sites that overlapped predicted sites and averaged it over all datasets for each parameter setting. This “overlap” thus runs from zero to one. For each parameter setting the standard error of the overlap is given by

where _{i}

For

The data for

To produce the left panel of _{i}_{i}_{i}_{0} all predicted sites were included in _{0}, at _{1} all but the last 100 sites were included in _{1}, and generally _{i}_{i}

and the standard error we similarly estimate as

This estimate of standard error correctly takes into account the fact that as the number of predictions

For the right panel of _{i}_{i}_{i}_{i}_{i}

We “cleaned up” the dataset of experimentally documented binding sites from the SCPD [

The upstream regions of the 200

We aligned each set of orthologous intergenic regions with Dialign [

For PhyloGibbs, EMnEM, and PhyME we needed to specify the phylogeny of the _{i}_{cer,par} = 0.74, _{cer,mik} = 0.6, and _{cer,bay} = 0.52. In the approximation of a star topology, the conservation rates _{i,j}_{i}_{j}

Assuming that _{cer} = _{par} we obtain _{cer} = _{par} = 0.8, _{mik} = 0.58, and _{bay} = 0.45. No conservation rate was reported in [_{kud} should lie between those of _{kud} = 0.5. PhyloGibbs, PhyME, and EMnEM were all run with this phylogenetic tree. EMnEM requires branch lengths in terms of the number of substitutions per site, and we used ^{−n} to determine the number of substitutions

For reference we again give the command lines that we used in running the algorithms on the 200 genes with documented sites. For PhyloGibbs with phylogeny we used –D 1 –T 0.35 –m 10 –N 3 –F bgfile –I 3,3,3 –E 0.01 –f infile –L (cer:0.8,par:0.8,mik:0.58,kud:0.5, bay:0.45). Here bgfile is a fasta file with all

We should point out that the performances of the different algorithms may vary as one varies parameter settings. We experimented with different parameter settings for each of the algorithms but none substantially changed the results shown in

The specificity-versus-sensitivity plots in the left panel of

For the right panel of _{p}_{r}_{p}_{r}_{p}

We used version 24 of the regulatory code from [

To compare the results of PhyloGibbs with those of [_{αi} by the total number of sites _{αi} = _{αi}

Let _{αi} and _{αi} being the number of times base α occurs at position _{n}_{m}_{αi} that base α occurs at position _{αi} = _{αi} when 1 ≤ _{αi} = _{αi} + _{α(i − k)} when (_{m},_{αi} = _{α(i − k)} when _{m}_{n}

where γ is the pseudocount of the prior over WM space and _{i}

where π is the prior probability. For each alignment _{n}_{m}_{n}_{m}

Finally, for each combination _{n}_{m}

In the

In addition, for each combination of a reported motif from PhyloGibbs and a TF with annotated sites in [

The results in _{r}_{t}_{r} ,m_{t} ,m_{r} ,m_{t} ,m_{r}_{t}_{r} ,m_{t} ,m_{r} ,m_{t} ,m_{r} ,m_{t} ,m_{r}_{r} ,m_{r}_{t}_{t}

For 11 TFs we gathered sets of target genes from the literature, collected their orthologs from the other sensu stricto species, obtained multiple alignments with Dialign, and ran PhyloGibbs on these sets of multiple alignments. The following command line options were used for all these runs: –D 1 –T 0.25 –a 300 –S 300 –L (cer:0.8, par:0.8, mik:0.58, kud:0.5, bay:0.45) –N 3 –F bgfile –f infile. A summary of the results of these runs, and the remaining parameter settings used, are shown in

Results of PhyloGibbs on Multiple Alignments of Upstream Regions Taken from the Literature

Detailed results, and the locations of all the binding sites newly identified in these runs, can be found in

This file lists all sites with posterior probability 0.05 or higher that PhyloGibbs predicted on the upstream regions of the genes that have one or more binding sites annotated in the SCPD [

(141 KB TXT)

This file summarizes the comparisons of the results of PhyloGibbs on the data from [

(99 KB TXT)

This file contains all binding sites with posterior probability at least 0.05 that PhyloGibbs predicted for the 45 TFs with between three and 25 sites annotated in [

(42 KB TXT)

For 11 TFs we gathered lists of genes that are known to be regulated by the TF from the literature. This file gives the list of ORF names of these genes for each of the 11 TFs. Example: DAL80 YKR034W YIR032C YDL210W YFL021W. This line shows that the TF DAL80 is reported in the literature to regulate the ORFs YKR034W, YIR032C, YDL210W, and YFL021W.

(10 KB TXT)

This file has the same format as

(9 KB TXT)

Results analogous to those shown in

(62 KB PDF)

Results analogous to those shown in

(110 KB PDF)

Results as in the left panel of

(60 KB PDF)

This table shows a comparison of the exact WM integrals with the monomial approximation that our algorithm employs.

(7 KB PDF)

Support was provided from the National Science Foundation, grant DMR-0129848. EV received support from the Swiss National Science Foundation, project 3152A0–105972. Michael Mwangi programmed the script to reformat the Dialign output. Nicolas Buchler supplied 5′-UTR yeast sequences, based on the publicly available ones, from which coding shadows had been removed. RS thanks the Indian Lattice Gauge Theory Initiative for computer time on the “Kabru” cluster at the Institute of Mathematical Sciences. EV thanks Saurabh Sinha for help running PhyME and for useful comments on the manuscript.

expectation maximization

open reading frame

Promoter Database of

transcription factor

weight matrix