A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

doi:10.1371/journal.pcbi.1000010

A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval

Figure 3

Evaluation of motif comparison scores.

(A) Generating the test data set: Given a set of genomic binding sites for a transcription factor, we generate motifs by randomly sampling subsets of genomic binding sites (including 5, 15, or 35 samples per motif), aligning them, and then truncating the resulting motif to include only a part of the motif. By repeating this procedure, slightly different sets of binding sites were built for each factor. This “Yeast” data set consisted of noisy motifs for nine different S. cerevisiae transcription factors using the genomic sequences obtained by Harbison et al. [13], with a total of 240 motifs for each factor. (B) Sensitivity and specificity of different scoring methods: Comparison of different scoring methods on the “Yeast” data set using a subset of motifs generated from subsets of size 35 with altered lengths (not including the full length motifs, 685 motifs). Each similarity score was assigned an empirical statistical significance p-value. The ROC curve plots the true positive rate (TPR) vs. the false positive rate (FPR), as computed for different p-value thresholds, where pairs of motifs generated from genomic binding sites that were associated with the same factor are considered true positives. The BLiC score (green, using a Dirichlet prior, or blue, using a Dirichlet-mixture prior) outperformed all other similarity scores: Jensen-Shannon (JS) divergence (red), Euclidean distance (purple), and Pearson Correlation coefficient (cyan). The full arsenal of comparisons is shown in Figure S2. (C) Sensitivity and specificity estimated by structural data: Same as (B), but using the “Structural” data set of Mahony et al. [24]. Pairs of motifs from the same structural family are considered true positives.

doi: https://doi.org/10.1371/journal.pcbi.1000010.g003