Fig 1.
MotifSpec optimizes for specificity rather than over-representation and uses a dynamic search space.
(A) An over-represented motif is found in the search space more often than expected according to some background model. It is not necessarily predictive. A specific motif is found in a much higher frequency in the search space than in the background sequences. A dynamic search space threshold finds the optimal search space such that the motif is most discriminative. (B) A schematic of the MotifSpec algorithm. The PWM model is initialized with a random sequence and position in the search space. The model is iteratively refined and the motif and binding score thresholds are adjusted at convergence to maximize specificity. (C) An example of sequences scored using the model. Each sequence has a motif score and a binding score. The binding score determines if a sequence is in the search space. The motif score determines if the sequence has an instance of the motif. The sequences are color-coded according to the set to which they belong as defined in (B).
Fig 2.
MotifSpec performs comparably to HOMER and Dimont and consistently better than DECOD, DREME, and Amadeus in finding a discriminative motif when run on ChIP-seq data for three human transcription factors, CTCF, NRSF and the estrogen receptor (ER). Panels a, b, and c show the ROC curves and auROC values for the top scoring motif from each program when run on the three datasets. Panel d shows a summary comparison of auROC for each algorithm and motif, and panel e shows the top scoring motif found by each program.
Fig 3.
Mouse ChIP-seq results MotifSpec outperforms DREME when run on ChIP-seq data for 13 transcription factors from mouse embryonic stem cells.
The left panel shows a plot of the AUC for the top motif reported by MotifSpec against the AUC for the top motif reported by DREME, while the right panel shows the improvement in AUC for the MotifSpec motif relative to the DREME motif.
Fig 4.
Binding specificities for four C. elegans transcription factors as learnt from ChIP-seq data from the modENCODE project.
Fig 5.
Motifs found by MotifSpec perform better at retrieval of bound probes than the motifs found by Seed-and-Wobble.
The barchart shows the percentage improvement in the area under the receiver-operator characteristic (ROC) curve, and the top motif found by MotifSpec performs better than the Seed-and-Wobble motif in the majority of cases where either motif has an AUC of 0.75 or better. Three representative ROC curves are shown, two (Gln3 and Pbf2-9) in which MotifSpec outperforms Seed-and-Wobble and one in which Seed-and Wobble is better (Sum1-9). The red curve is the ROC for the Seed-and-Wobble motif and the green curve is the ROC for the best MotifSpec motif.
Fig 6.
MotifSpec performs better at recovery of seeded motifs from a synthetic sequence-expression dataset than two-step procedures of k-means clustering and motif-finding using AlignACE, MEME and Weeder.
Fig 7.
MotifSpec detects more known yeast motifs than the combination of k-means clustering and AlignACE (km-aa).
There were 97 known motifs in total. A CompareACE motif similarity score of 0.75 or greater was considered a match. ChIP target sets were considered a match if the hypergeometric p-value for overlap was less than 10−7.
Fig 8.
The top 5 motifs found by MotifSpec in a genome-wide search of a C. elegans sequence and expression dataset.
Alongside each motif is its specificity score and any Gene Ontology (GO) and Anatomy Ontology (AO) terms that were enriched in the list of target genes.