A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

doi:10.1371/journal.pcbi.1000010

A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

Figure 4

Evaluation of clustering DNA motifs.

(A) Motif clustering: In this example, the initial motif set consists of four motifs. The score assigned to each pair of motifs is the score of the best possible alignment between them (including the reverse complement form, as demonstrated in this example). In each step the highest scoring pair is merged into a new motif (by combining the evidence from both motifs). These steps are repeated until we are left with a single motif. The order of merge operations results in a tree, where the leaves are the initial motifs. Each frontier in this tree creates a set of motifs. A frontier in a tree is a subset of nodes, non-descendent to each other, with every leaf in the tree a descendant of one of them. In this example, a frontier resulting in two motifs is chosen, one is an initial motif and the other is a motif created by merging three initial motifs. These two motifs are the non-redundant set of motifs, derived from the initial set. (B) Evaluation of clustering with different scoring methods: Motifs from the “Yeast” data set, generated from subsets of size 15 (180 motifs), were clustered. We split the resulting clustering tree using different thresholds. Each such threshold defines a different tradeoff between true positive rate (percent of correctly classified motifs in the clustering tree) versus the number of clusters. In this graph we plotted the average of nine repeats of clustering sets of 180 motifs described above (total of 1620 different noisy motifs). This tradeoff curve demonstrates that our BLiC score (green, using a Dirichlet prior, and blue, using a Dirichlet-mixture prior) outperforms all other scoring methods, Pearson Correlation, Euclidean distance, and Jensen-Shannon. A more detailed evaluation of clustering noisy motifs using various similarity scores is shown in Figure S3. (C) Clustering evaluated by structural data: Tradeoff curves (as in (B)) for clustering motifs in the “Structural” data set [24]. Pairs of motifs from the same structural family are considered true positives.

doi: https://doi.org/10.1371/journal.pcbi.1000010.g004