A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

doi:10.1371/journal.pcbi.1000010

Figure 1.

Overview of the challenges in DNA motif analysis.

(A) Identifying DNA binding motifs: Applying motif discovery algorithms to a group of related DNA sequences leads to the identification of putative transcription factor DNA binding sites. These algorithms output a set of DNA motifs, which are frequently redundant. To infer the correct transcription regulation map from the discovered motif set, it is crucial to reduce this redundancy and to relate the discovered motifs to known ones. (B) Reducing redundancy by clustering and merging motifs: A redundant set of DNA motifs can be reduced by clustering the motifs into groups of related ones and merging the motifs within each cluster. In this example, a redundant set of 16 DNA motifs (a partial output of several motif search algorithms) is clustered and merged to a final set consisting of three DNA motifs. (C) Relating motifs to known factors: The transcription factors that bind the newly discovered DNA motifs can be revealed based on similarities to previously defined motifs. In this example, comparison of a newly discovered motif to four known motifs reveals high similarity to the Gcn4 binding motif. From this comparison the transcription factor that binds the motif is identified with high probability.

More »

Expand

Figure 2.

Problematic aspects of previous motif similarity scores.

(A) Distinguishing between informative and non-informative positions: Two pairs of aligned motifs are demonstrated, both of which having three identical positions and two different ones. While the identical positions in the first pair (left) are non-informative, the identical positions in the second pair (right) are informative. The desired similarity score should distinguish between these two types of similarities and assign a higher score to pair number 2. The nucleotide distributions are visualized so that the height of each nucleotide is proportional to its probability (see a real life example in Figure S1). (B) Problematic aspects of motif similarity scores: The similarity score of two position frequency matrices (PFMs) decomposes into the sum of similarities of single aligned positions, due to the common position-independence assumption in the model. Here we present the similarity scores for various pairs of positions in DNA motifs according to several similarity functions, in addition to the desired score (scores are normalized to arbitrary scale of −1 to 1). The nucleotide distribution in each position is visualized as in (A) (the height of each nucleotide is proportional to its probability). As shown here, all scores (Pearson correlation, Jensen-Shannon divergence, and Euclidean distance) do not reflect the “true” similarity between two distributions or cannot differ between informative and uniform background positions. Specifically, position 1 should get a higher score than position 2, but the Pearson correlation scores for these positions are equal. Position 3 should get the lowest possible score, yet the Pearson correlation does not capture this. Both in positions 1 and 4 identical distributions are compared, but the informative position 1 should get a higher score than position 4. However, all three methods fail to obtain this. Both positions 4 and 5 analyze nearly-uniform distributions. While in position 4 two identical distributions are compared, in position 5 there are small variations, which alter the order of nucleotides. As we show, Pearson correlation grades position 5 substantially lower than position 4.

More »

Expand

Figure 3.

Evaluation of motif comparison scores.

(A) Generating the test data set: Given a set of genomic binding sites for a transcription factor, we generate motifs by randomly sampling subsets of genomic binding sites (including 5, 15, or 35 samples per motif), aligning them, and then truncating the resulting motif to include only a part of the motif. By repeating this procedure, slightly different sets of binding sites were built for each factor. This “Yeast” data set consisted of noisy motifs for nine different S. cerevisiae transcription factors using the genomic sequences obtained by Harbison et al. [13], with a total of 240 motifs for each factor. (B) Sensitivity and specificity of different scoring methods: Comparison of different scoring methods on the “Yeast” data set using a subset of motifs generated from subsets of size 35 with altered lengths (not including the full length motifs, 685 motifs). Each similarity score was assigned an empirical statistical significance p-value. The ROC curve plots the true positive rate (TPR) vs. the false positive rate (FPR), as computed for different p-value thresholds, where pairs of motifs generated from genomic binding sites that were associated with the same factor are considered true positives. The BLiC score (green, using a Dirichlet prior, or blue, using a Dirichlet-mixture prior) outperformed all other similarity scores: Jensen-Shannon (JS) divergence (red), Euclidean distance (purple), and Pearson Correlation coefficient (cyan). The full arsenal of comparisons is shown in Figure S2. (C) Sensitivity and specificity estimated by structural data: Same as (B), but using the “Structural” data set of Mahony et al. [24]. Pairs of motifs from the same structural family are considered true positives.

More »

Expand

Figure 4.

Evaluation of clustering DNA motifs.

(A) Motif clustering: In this example, the initial motif set consists of four motifs. The score assigned to each pair of motifs is the score of the best possible alignment between them (including the reverse complement form, as demonstrated in this example). In each step the highest scoring pair is merged into a new motif (by combining the evidence from both motifs). These steps are repeated until we are left with a single motif. The order of merge operations results in a tree, where the leaves are the initial motifs. Each frontier in this tree creates a set of motifs. A frontier in a tree is a subset of nodes, non-descendent to each other, with every leaf in the tree a descendant of one of them. In this example, a frontier resulting in two motifs is chosen, one is an initial motif and the other is a motif created by merging three initial motifs. These two motifs are the non-redundant set of motifs, derived from the initial set. (B) Evaluation of clustering with different scoring methods: Motifs from the “Yeast” data set, generated from subsets of size 15 (180 motifs), were clustered. We split the resulting clustering tree using different thresholds. Each such threshold defines a different tradeoff between true positive rate (percent of correctly classified motifs in the clustering tree) versus the number of clusters. In this graph we plotted the average of nine repeats of clustering sets of 180 motifs described above (total of 1620 different noisy motifs). This tradeoff curve demonstrates that our BLiC score (green, using a Dirichlet prior, and blue, using a Dirichlet-mixture prior) outperforms all other scoring methods, Pearson Correlation, Euclidean distance, and Jensen-Shannon. A more detailed evaluation of clustering noisy motifs using various similarity scores is shown in Figure S3. (C) Clustering evaluated by structural data: Tradeoff curves (as in (B)) for clustering motifs in the “Structural” data set [24]. Pairs of motifs from the same structural family are considered true positives.

More »

Expand

Figure 5.

Overview of the motif analysis pipeline.

The first step of the pipeline involves searching for motifs in each input set of DNA sequences, using complementary motif discovery algorithms. The motifs are filtered according to their abundance in the input set. In the second step the redundancy in the newly discovered set of motifs is reduced by clustering and merging the similar motifs. These steps are performed separately for each set (top boxes). Then, the motifs found in each input set are clustered and merged to create a global non-redundant set of motifs. These motifs are then associated with known motifs from pre-existing libraries. The refined motif set is ranked and filtered according to their abundance in each input set.

More »

Expand

Figure 6.

Overview of the discovered motifs.

Investigation of the properties of discovered motifs. Each motif (column) is compared to other motifs using the BLiC score (rows, top square), to enrichment of putative targets among expressed or silenced genes within a compendium of gene expression at different cellular conditions (second group), to the enrichment of targets within various GO annotations (third groups) and in ChIP-chip location assays (bottom group). The rows and columns were clustered using EdgeCluster [33], an agglomerative clustering procedure that integrates various sources of information into the clustering process. Shown is clustering for partial sets of motifs related to the transcription factors: Fhl1, Sfp1, Rap1, Hsf1, Ste12, Mcm1, Swi4, Swi6, and Mbp1 (the full clustering is presented in Figure S4 and on http://compbio.cs.huji.ac.il/BLiC).

More »

Expand

Figure 7.

Comparison to previous analysis methods.

Comparing our discovered set of motifs to the ones learned by Harbison et al. [13] and MacIsaac et al. [34]. We plot the fraction of motifs that obtained the highest score among all three sets. We first compare transcription factors with previously characterized motifs by their similarity to the known motif from the literature [26]–[28], calculated using our BLiC score. For this comparison, we took for each transcription factor the motif most similar to the known binding site (as done in these two previous works). Our motifs received the highest similarity score (among all three studies) in 65% of the cases (right). The second comparison is for transcription factors with no characterized binding motif. This comparison is based on the enrichment of the motifs in the ChIP-chip data sets. For this comparison we took the most highly enriched motif for each factor and condition (for consistency with the two previous works). The same parameters were applied in the analysis of motifs from all three methods. In this setting, our motifs were found to have higher enrichments in 80% of the cases (left).

More »

Expand

Figure 8.

Condition dependent behavior of Ste12.

(A) A Venn diagram representing the results of the ChIP-chip experiment [13] for Ste12 under mating (induced by alpha factor) and filamentous growth (induced by butanol). Ste12 alters its targets substantially between these two conditions. (B) Analysis of the percent of sequences bound by Ste12 which contain the different motifs (when searching for motif occurrences at 2% false positive rate). Shown are the different motifs in the targets bound by Ste12 in filamentous growth condition only (yellow), in mating condition only (blue), or in both conditions (green). Each motif is shown as a sequence logo on the left and percent occurrence in each group as bar chart on the right. We can see that under filamentous growth there is enrichment for a motif similar to the previously characterized Ste12 motif (top motif), as well as the known recognition sequence of Tec1 (third from top). Under mating there is an enrichment for a near-perfect tandem repeat of Ste12 known binding site (second from top). A motif similar to the known Mcm1 motif (bottom motif) is found to be enriched under both conditions, especially under filamentous growth.

More »

Expand

Figure 9.

Condition dependent behavior of Aft2.

(A) Venn diagrams representing the results of the ChIP-chip experiment [13] for the transcription factors Aft2 and Rcs1 under high and low H₂O₂ stress. Aft2 alters its targets substantially between these two conditions. (B) Analysis of percent of sequences bound by Aft2, which contain the different motifs (when searching for motif occurrences at 2% false positive rate). Shown are the different motifs in the targets bound by Aft2 and Rcs1 in low H₂O₂ stress only (yellow and blue, respectively), in high H₂O₂ stress only (red and green, respectively) or in both conditions (orange and cyan, respectively). Each motif is shown as a sequence logo on the left and percent occurrence in each group as bar chart on the right. Under low H₂O₂ stress there is enrichment for a motif similar to the previously characterized Aft2 motif (top motif), as well as for the known recognition sequence of Rcs1 (middle motif). Under high H₂O₂ stress only abundant low-complexity repeats of Poly-GT (bottom motif) have been identified.

More »

Expand