FilterDCA: Interpretable supervised contact prediction using inter-domain coevolution

doi:10.1371/journal.pcbi.1007621

Fig 1.

Typical patterns in inter-domain contact matrices.

(A) The frequencies of contacts in a 15×15 contact-map window around an HH or EE contact are displayed. The average is done over the 46978 HH and 12281 EE contacts of the training set. The mean contact matrices are a combination of parallel, anti-parallel or mixed HH and EE contacts. (B) We disentangle them with a k-means clustering with k = 3. The 6 resulting centroids, 3 for a central HH inter-domain contact (upper row) and 3 for an EE contact (lower row), show pronounced patterns and can be used to filter DCA predictions.

More »

Expand

Fig 2.

Typical patterns in DCA-score matrices.

(A,B) The average DCA scores in a 15 × 15 window around an HH or EE contact are displayed, using the same selection of contacts as in Fig 1 for panels A, and the same sub-clustering for panels B.

More »

Expand

Fig 3.

General scheme of FilterDCA: Our approach combines the results of plmDCA applied to two-domain MSAs with structural filters constructed as average contact matrices using 6 contact classes.

Structural supervision is used to learn a logistic regression based on the plmDCA score itself, and the best correlation with one of the six structural filters.

More »

Expand

Fig 4.

Decision boundary for logistic regression.

The lines show, for large (panel A) and medium (panel B) MSA sizes, the decision boundary defined by P(⊕|x) = 1/2, for different filter sizes raging from 5 × 5 to 69 × 69.

More »

Expand

Fig 5.

Positive predictive values of FilterDCA for inter-domain contact prediction.

PPV are shown as a function of the number of predictions for large (panel A) and medium (panel B) MSA sizes, averaged over the different domain-domain interactions in the test set. Different filter sizes are compared to standard plmDCA.

More »

Expand

Fig 6.

Contact map prediction for the interaction between the domain families PF02773 and PF00438 co-occurring in S-adenosylmethionine synthetases.

The upper panels (A) show the predictions of plmDCA, the lower panels (B) those of FilterDCA, for the first 10, 100 or 1000 predicted contacts. In each figure, the native contacts are shown in grey, TP predictions of the two methods in blue, FP predictions in red. This example shows a case where plmDCA detects some signal, which is subsequently cleaned by FilterDCA.

More »

Expand

Fig 7.

Contact map prediction for the interaction between the domain families PF01557 and PF09298 co-occurring in Fumarylacetoacetate (FAA) hydrolases.

The upper panels (A) show the predictions of plmDCA, the lower panels (B) those of FilterDCA, for the first 10, 100 or 1000 predicted contacts. In each figure, the native contacts are shown in grey, TP predictions of the two methods in blue, FP predictions in red. This example shows a case where the plmDCA signal is weak, and consequently cleaning by FilterDCA does lead to a marginal improvement only.

More »

Expand

Fig 8.

Fraction of true residue as a function of the FilterDCA score.

The fraction of true contacts is evaluated in the test sets for large (panel A) and medium (panel B) MSA sizes, as a function of the predicted contact probability P(⊕|x) provided by FilterDCA (filter size 45 × 45). We observe a clear linear relationship, with a slope slightly below one due to overfitting effects resulting from the finite training set and the limited complexity of our regression model.

More »

Expand

Fig 9.

Positive predictive values of FilterDCA for inter-protein contact prediction.

PPVs are shown as a function of the number of predictions, averaged over the different protein-protein interactions in the test set. Different filter sizes are compared to standard plmDCA.

More »

Expand

Fig 10.

Positive predictive values for inter-domain contact prediction.

PPVs for plmDCA, FilterDCA and PconsC4 are shown as a function of the number of predictions, averaged over the different domain-domain interactions in the test set. Predictions within 5 residues from the concatenation of the two domains are removed from the prediction, since they provide incoherent inputs to CNN and tend to produce many FP; this artifact strongly reduces the accuracy of CNN-based contact predictors.

More »

Expand

Fig 11.

Positive predictive values for inter-protein contact prediction.

PPVs for plmDCA, FilterDCA, PconsC4, RaptorX-ComplexContact and DeepMetaPSICOV are shown as a function of the number of predictions, averaged over the 18 different PPI in the test set, for which RaptorX successfully produced output. Predictions within 5 residues from the concatenation of the two domains are removed from the prediction, since they provide incoherent inputs to CNN and tend to produce many FP; this artifact strongly reduces the initial accuracy of CNN-based contact predictors.

More »

Expand

Table 1.

We split the 2598 MSAs of interacting domains in 3 datasets according to the effective number of independent sequences, M_eff.

They contain approximately the same number of MSAs (first row) and inter-domain contacts (second row). In brackets, the percentage of inter-domain contacts is given with respect to the total number of inter-domain residue pairs.

More »

Expand

Fig 12.

The distributions of DCA scores of contacts and non-contacts.

For MSAs with M_eff > 200 (Panel A) and 50 < M_eff ≤ 200 (Panel B). Note that the enrichment of true positive predictions (contacts) is very high in the tail of large DCA scores. In fact, the majority of the pairs with DCA score larger than 0.3 corresponds to contacts. This is not the case for MSA with M_eff < 50 (Panel C), where the two distributions completely overlap.

More »

Expand

Fig 13.

Filters used in the calculation of the filter score.

The filters were determined from the inter-domain contact maps in the small MSA, with k-means clustering for k = 3 applied separately for HH and EE contacts.

More »

Expand