Imputation for transcription factor binding predictions based on deep learning

doi:10.1371/journal.pcbi.1005403

Fig 1.

The TFImpute model.

Each input is a TF-cell-sequence triple. In the convolution layer, each filter (motif) corresponds to a column. Each filter scans the input sequence and produces one value at each stop. For each filter, the max-pooling layer partitions the signal into three windows and takes the maximum value in each window to obtain three values. The same gate signal operates on the three values, and the gate signal is different for different filters. For each input, the reverse complement of the input sequence together with the TF and cell line is constructed and used as another input for the same network. Therefore, for each input, we obtained two values for forward and reverse strand of the sequence: P1 and P2. The maximum of P1 and P2 is taken as the final prediction. During training, the prediction was compared with the target, and the error was back-propagated to learn the parameters of the whole network.

More »

Expand

Fig 2.

AUC comparison of TFImpute with DeepBind and gkm-SVM using shuffled sequences as negative instances.

(A) Comparison with DeepBind. Each point in the figure corresponds to a TF-cell line combination. (B) AUC for TF-cell line combinations in which DeepBind gives the lowest AUC. (C) Comparison with gkm-SVM using randomly shuffled sequences as negative instances. Each point in the figure corresponds to a TF-cell line combination.

More »

Expand

Table 1.

Subsets of TF-cell line combinations using DNase I hypersensitive sites as background.

More »

Expand

Table 2.

Random partition of the instances in Base into 3 disjoint subsets.

More »

Expand

Table 3.

Subsets of TF-cell line combinations using GC matched negative instances.

More »

Expand

Table 4.

Random partition of the instances in Base of Table 3 into 3 disjoint subsets.

More »

Expand

Fig 3.

Comparison with gkm-SVM, PIQ, and DeepSEA.

(A) AUC comparison of TFImpute and gkm-SVM on TestSet1, TestSet2, and TestSet3. ‘Shuf cell line indicates that the cell line of the corresponding test set was shuffled and that the trained TFImpute model was then applied to the shuffled dataset. Similarly, ‘Shuf TF’ indicates that the TFs were shuffled. For some of the given regions, PIQ give NA predictions. NA means that there is no motif based on log probability threshold of 5, or the region is lack of DNase I signal. PIQNoNA in this figure denotes the result after removing all NAs and PIQ denotes the result after treating NAs as no binding. To calculate the AUC, the predictions were grouped by TFs. The middle bar in each box indicates the median. (B) AUC comparison based on predictions grouped by TF-cell line combinations. (C) The recall rates of different methods at FDR 0.05 (See Material and methods for more details). The predictions were grouped by TFs. (D) AUC comparison of TFImpute on TFs appearing in both TestSet2 and TestSet3. (E) Hierarchical clustering of a subset of the TFs based on the learned embedding by TFImpute. The full clustering is shown in S3 Fig. (F) Hierarchical clustering of a subset of cell lines based on the learned embedding by TFImpute. The full clustering is shown in S4 Fig. (G) The recall rate of TFImpute and DeepSEA at different FDR cutoffs on the datasets provided by DeepSEA.

More »

Expand

Table 5.

Running time of different methods.

More »

Expand

Fig 4.

The distributions of the calculated enhancer signature for the top and bottom 100 enhancers.

The p value is calculated using t-test. We would like to emphasize the lack of data of the enhancer reporter assay of GM12878, which is a good control.

More »

Expand

Fig 5.

Predicted binding affinity change between two alleles of SNP rs12740374 (T/G).

The color in each cell represents the predicted binding affinity of allele G minus that of allele T for the corresponding TF and cell line. The number in each cell of the heatmap is the number of ChIP-seq datasets in the training set for the corresponding TF and cell line. If TFImpute predicted strong binding in the minor allele but no binding in the major allele, the score was 1. If TFImpute predicted no binding difference between the two alleles, the score was 0.

More »

Expand