Figure 1.
(A) Vertices of the BSG represent nucleotides in input promoters. Here, the input nucleotides are represented as points along the perimeter of a circle. Each disconnected bar along the perimeter of the circle represents a single promoter, with the transcription start site indicated.
(B) Edges (arcs across the circle) are added to each pair of aligned nucleotides in a motif resulting from a single Gibbs sampling prediction.
(C) Edges are compiled across ensemble Gibbs sampling results. Recurring edges are weighted by the number of times they recur, as indicated by different colored arcs in the BSG. Once all edges are collected from ensemble Gibbs sampling results, edge weights are normalized by the most frequently recurring edge.
Figure 2.
An Illustration of the Weighted Clustering Coefficient: Behavior of the Weighted Clustering Coefficient of Vertex n Is Shown across Four Increasingly Sparse Graphs
(A) In the most dense case, each pair of vertices adjacent to n is also adjacent, and all edge weights are maximal. Thus, n participates in a maximally weighted clique, and the weighted clustering coefficient is 1.
(B) As edges are removed from the clique, the neighborhood of n becomes less well-connected, and the clustering coefficient decreases.
(C) Unlike the original definition of clustering coefficient, the weighted clustering coefficient responds to the decreased intensity of the cluster resulting from intermediately weighted edges.
(D) Finally, when no edges exist between neighbors of n, the clustering coefficient goes to 0.
Figure 3.
Predicting TFBS within the BSG Framework
(A) First, a BSG is constructed from ensemble Gibbs Sampling. Here, the perimeter of the circle represents promoters, and lines between nucleotides in the promoters correspond to edges in the BSG. Edges are heat-mapped according to edge weight.
(B) The filtered BSG, obtained by selecting an edge-weight threshold ρ to maximize the BSGscore, followed by all edges with weight less than ρ from the graph.
(C) Final TFBS predictions are made from the filtered BSG by collecting nucleotides contiguous in the original promoters into prediction sequences, which are returned in fasta format. The promoter region depicted contains two predicted TFBS.
Figure 4.
Frequency (y-Axis) of BSGscores (x-Axis) for Set of 7–30 Randomly Selected Yeast Promoters (Blue Bars) Compared with the Probability Distribution of the Estimated Generalized Extreme Value Distribution (Orange Line)
Empirical and estimated cumulative probability distribution (black triangles and orange line, respectively) are shown in the inset. The empirical and estimated distributions are the same with p = 0.997 according to a KS test.
Table 1.
Summary of Genome-Wide Predictions of TF Specificities
Figure 5.
Differences between BSG Predictions and Gibbs Sampling
The motif predicted by BSGs is compared with the best-scoring motif from an equivalent amount of Gibbs sampling. In some cases, such as HSF1 and LEU3, BSGs perform better through better estimation of the width of the motif. In such cases, manually choosing the correct motif width based on a priori knowledge allows Gibbs sampling to predict the correct motif. In other cases, however, such as SIP4 and RDS1, choosing the best Gibbs sampling width does not produce the correct prediction. For RDS1, N/A indicates that the motif width reported previously [29] matches the width of the best Gibbs sampling motif, and thus manually selecting the motif width does not alter the Gibbs sampling prediction.
Figure 6.
Benchmarked Evaluation of BSG Binding Site Predictions for Yeast Datasets from Tompa et al. [13]
Performance of BSG predictions are compared with the three best-performing algorithms according to a previously published evaluation. Performance measures (x-axis) are nSn (nucleotide sensitivity), nPPV (nucleotide positive predictive value), nPC (nucleotide performance coefficient), nCC (nucleotide correlation coefficient), sSn (site sensitivity), sPPV (site positive predictive value), and sASP (average site performance). For formulas used to calculate these measures, see Materials and Methods. BSGs significantly outperform all previous evaluated algorithms in nearly every measure. Most notable are improvements in nucleotide and site positive predictive value, where predictions from BSGs achieve values of 0.71 and 0.77, respectively.
Figure 7.
Comparison of Robustness to Noisy Decoy Promoters between BSG Predictions, Positional Clustering [25], and an Equivalent Amount of Gibbs Sampling Runs (6,656 Gibbs Sampling Predictions)
For each of the signal sets, varying numbers of random S. cerevisiae promoters were added to the original ChIP–chip derived set (x-axis). TFBS predictions were made using BSGs (circles), positional clustering (triangles), and the best predictions from an equivalent number of iterations of Gibbs sampling alone (squares). For each set of predictions, the PPV (y-axis) was calculated by comparing the prediction with published motifs as described in Methods. For BSG predictions, filled, half-filled, and open circles represent p < 0.01, p < 0.1, and p > 0.1, respectively. BSGs attain dramatically higher PPV than Gibbs sampling alone, especially in the noisiest input sets. In some cases, the PPV does not decrease monotonically with the addition of noise. This effect is the result of spurious instances of the binding site occurring in the decoy promoters. Although STE12 predictions are not significant, the well-known motif is almost always discovered. In all STE12 predictions, multiple components were identified in the BSG, highlighting the need to generalize the p-value to graphs with multiple motifs.