The Next Generation of Transcription Factor Binding Site Prediction

doi:10.1371/journal.pcbi.1003214

Figure 1.

HMM schemas.

(A) 1st-order HMM schema used in 1st-order TFFMs where the first state represents the background and the following states the consecutive positions within a TFBS. Each state emits a nucleotide with a probability dependent on the nucleotide emitted previously. (B) HMM schema used in detailed TFFMs where each state in the 1st-order HMM is decomposed into four states (one per nucleotide). Transition probabilities reflects the emission probabilities of the 1st-order HMM. It allows the start of a TFBS depending on the nucleotide emitted by the background states.

More »

Expand

Figure 2.

Sequence logo representing a TFFM.

(A) Graphical representation of a TFFM constructed for the Hnf4A TF. Each column corresponds to a position within a TFBS. Each row captures the probabilities of each nucleotide to appear depending on the nucleotide found at the previous position. The opacity of a case represents the probability of hitting this case depending on the probability of appearance of the corresponding nucleotide at the previous position (the higher the opacity, the higher the probability). (B) The summary logo compacts all the information to summarize the dense logo in (A). (C) Zooming in on the dense TFFM logo for positions 10 to 13 (corresponding to the box in (A)). We observe that a “C” is more likely to appear at position 12 if nucleotide “T” was found at position 11 whereas a “T” is more likely to appear at position 12 if nucleotide “G” was found at position 11.

More »

Expand

Figure 3.

Performance comparison between TFFMs and weight matrices.

For the 96 ChIP-seq data sets obtaining an % for at least one method (using a genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. The four types of models were used (1st-order TFFM, detailed TFFM, PWM, and DWM). By considering a similar performance between two methods when the AUC ratio is %, we plot at the top of the figure the region where the weight matrices (WMs) best perform, where the TFFMs best perform, and where they are similar. AUC ratios are ranked from the least to the most favourable to the TFFMs.

More »

Expand

Table 1.

Statistical significance for discriminative power differences between the predictive methods.

More »

Expand

Figure 4.

Performances comparison between 0-order TFFMs, other TFFMs, and weight matrices.

For the 96 ChIP-seq data sets used in Figure 3 (using genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. (A) The three types of TFFMs were used (1st-order, detailed, and 0-order TFFMs). AUC ratios are ranked from the least to the most favourable to the 1st-order and detailed TFFMs. We observe that the 1st-order and detailed TFFMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences. (B) 0-order TFFMs and WMs were used. AUC ratios are ranked from the least to the most favourable to the 0-order TFFM. We observe that the WMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences.

More »

Expand

Figure 5.

Correlations between prediction scores and ChIP-seq peak scores or binding affinities.

(A) ChIP-seq signal values obtained from ENCODE data sets were compared to prediction values obtained with the four different predictive methods. The distribution of Spearman's correlation values from all data sets are given for 1st-order TFFMs, detailed TFFMs, PWMs, and DWMs. An over-representation of Spearman's correlations around 1 (perfect correlation) is found for the four methods. (B) Pearson correlation between scores obtained using the different predictive methods and DNA-binding affinities from [52].

More »

Expand

Table 2.

Pearson correlation coefficients between experimentally measured and predicted changes in affinity correlations.

More »

Expand

Figure 6.

ROC curve analysis of JunD ChIP-seq data.

TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs outperform the other models since the corresponding ROC curves are above ROC curves corresponding to other models.

More »

Expand

Figure 7.

ROC curve analysis of STAT4 and STAT6 ChIP-seq data.

TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs on STAT4 (A) and STAT6 (B) ChIP-seq data. Flexible TFFMs do not significantly perform better than fixed-length TFFMs. DWMs, PWMs, and GLAM2 produce a lower discriminative power than the TFFMs.

More »

Expand

Figure 8.

ROC curve analysis of MafK ChIP-seq data.

TFFMs allowing a motif with a flexible edge have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs perform slightly better than fixed-length TFFMs and both outperform the other models.

More »

Expand