Fig 1.
GA-IPA approach for paralog pairing between two interacting protein families.
We show schematics to illustrate the key points of our approach (these schematics are conceptual and do not represent actual phylogenetic trees or similarity networks). (A) One starts from the separate MSAs of two interacting protein families A and B. Each species present in these MSAs may contain multiple paralogs in each family. Our goal is to infer which paralog in family A interacts with which paralog in family B. As indicated, two types of information will be used: phylogeny and residue coevolution. (B) We first construct a sequence-similarity network, specifically a k nearest-neighbor (kNN) or an orthology network, for each of the two families. These two networks are aligned to find a pairing of the sequences that maximises the similarity of the two networks, while only allowing pairs within the same species. Repeated runs of a stochastic graph-alignment (GA) algorithm based on simulated annealing allow to identify robust pairs, which are consistently paired across GA runs. (C) This partial but robustly paired MSA is used as an input to the iterative pairing algorithm (IPA) based on residue coevolution as detected by DCA. IPA iteratively extends the paired MSA until all sequences are paired. (D) The output full co-MSA is our prediction for the interacting protein pairs between families A and B.
Fig 2.
Performance of graph alignment.
The mean fraction of true-positive pairings (TP fraction) is shown as a function of the number k of nearest neighbors in the kNN graph for 100 GA realizations (blue). Performance for the orthology graph is also shown (red)—but note that it does not depend on k. Error bars (shaded regions) correspond to one standard deviation. The mean TP value of the IPA starting without any training set of known paired sequences is shown for comparison (yellow). It was obtained using Nincrement = 6 and by averaging over 50 replicates that differ in their initial random pairings, cf. Materials and methods. The dotted black line shows the average TP fraction obtained for random HK-RR pairings within species (null model).
Fig 3.
(A) Robustness histograms for the 21-NN graph. We perform 100 GA runs and count how many times a HK is paired to the same RR. The horizontal axis gives the number of times a given pair appears, and the vertical axis is the number of pairs appearing that many times across replicates. Black bars are the total number of pairs, and red bars are the number of TPs. The TP ratios are indicated on top of the bars. (B) Number of robust pairs (occurring in all 100 GA runs) obtained by GA for each similarity network. The fraction of correctly matched pairs in this robust subset is indicated in each case.
Fig 4.
GA-IPA outperforms both GA and IPA.
The mean fraction of true-positive pairings (TP ratio) is shown as a function of the number of nearest neighbors k in the kNN graph. We combine GA and IPA in our GA-IPA method: we use the robust pairs obtained by GA as a seed co-MSA for IPA. The results of GA and of IPA without seed co-MSA from Fig 2 are shown for comparison. For IPA, we use Nincrement = 6, both without (IPA) and with (GA-IPA) seed co-MSA.
Fig 5.
Robust performance of GA-IPA in hard cases of paralog pairing.
(A) Same as Fig 4 but on a dataset having on average 29.2 paralogs per species, compared to 11.03 in Fig 4. While the performance of GA is substantially reduced compared to Fig 4, and that of IPA is even more reduced, GA-IPA achieves much larger TP fractions than GA and IPA. (B) Results of GA, IPA and GA-IPA for smaller datasets obtained by species subsampling from the full HK-RR data set, with 11.1 paralogs per species on average. We observe that GA-IPA needs almost one order of magnitude less sequences than IPA to reach comparable TP fractions.