Skip to main content
Advertisement

< Back to Article

SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps

Fig 2

SeqGL performs significantly better than traditional motif discovery methods across different settings.

(A) Plot showing the k-mer weight inferred group lasso regularized logistic regression for PAX5 ChIP-seq in GM12878 cell line. A number of groups are uniformly set to 0 (Group 5), while other groups are either significantly predictive of peaks or flanks (Group 3 and Group 2 respectively). Motifs identified for groups that are strongly predictive of peaks and the corresponding TFs are also shown. (B) PAX5 ChIP-seq auROC on the test set comparing the discriminative performance of SeqGL with motif finding tools and k-mer methods. The different colors correspond to the colors in Fig 2C. (C) Plots showing auROCs on test sets for 105 ChIP-seq experiments using different tools and settings. Three different settings were used for the motif finding tools HOMER, DREME and HOMER (see S1 Fig). ‘Best motif’ uses the highest-ranking motif from each method, as defined by the p-value; ‘Max motif’ uses the motif with maximum log odds score in each example; and ‘Motif elastic’ uses elastic net logistic regression across all motifs determined by the respective method. Only the ‘Motif elastic’ methods are shown in the performance plots, since they outperform the ‘Best motif” and ‘Max motif” methods. ‘SeqGL and other k-mer methods significantly outperform the different motif finding tools across all settings (Wilcoxon rank sum p-values < 7e-3). gkm-SVM performs marginally (but not significantly) better compared to SeqGL with 5K top discriminative features (Wilcoxon rank sum p-value = 0.06); SeqGL using a larger feature set (30K) gives identical performance to gkm-SVM (no difference in the distribution of auROC scores based on a Wilcoxon rank sum test, using p-value < 0.05 for our threshold of significance). Furthermore, the elastic-net regressor on the full SeqGL feature space using 10-mers with 3 wildcards (similar to settings used by gkm-SVM) also yields identical performance. While the discriminative accuracy is comparable, unlike other k-mer methods, SeqGL identifies multiple distinct DNA binding signals from the same ChIP-seq experiment (S3 Table).

Fig 2

doi: https://doi.org/10.1371/journal.pcbi.1004271.g002