Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins

Carlos A. Gandarilla-Pérez; Sergio Pinilla; Anne-Florence Bitbol; Martin Weigt

doi:10.1371/journal.pcbi.1011010

Abstract

Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.

Author summary

When two protein families interact, their sequences feature statistical dependencies. First, interacting proteins tend to share a common evolutionary history. Second, maintaining structure and interactions through the course of evolution yields coevolution, detectable via correlations in the amino-acid usage at contacting sites. Both signals can be used to computationally predict which proteins are specific interaction partners among the paralogs of two interacting protein families, starting just from their sequences. We show that combining them improves the performance of interaction partner inference, especially when the average number of potential partners is large and when the total data set size is modest. The resulting paired multiple-sequence alignments might be used as input to machine-learning algorithms to improve protein-complex structure prediction, as well as to understand interaction specificity in signaling pathways.

Citation: Gandarilla-Pérez CA, Pinilla S, Bitbol A-F, Weigt M (2023) Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins. PLoS Comput Biol 19(3): e1011010. https://doi.org/10.1371/journal.pcbi.1011010

Editor: Andrea Ciliberto, IFOM - the Firc Insitute of Molecular Oncology, ITALY

Received: August 24, 2022; Accepted: March 8, 2023; Published: March 30, 2023

Copyright: © 2023 Gandarilla-Pérez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Matlab implementations of the DCA-based IPA and the MI-based IPA on our standard HK-RR dataset are freely available at https://doi.org/10.5281/zenodo.1421861 and https://doi.org/10.5281/zenodo.1421781, respectively. Julia implementations of the GA-IPA (as well as of the IPA), both DCA-based and MI-based, are freely available at https://doi.org/10.5281/zenodo.7731108.

Funding: CAGP and MW acknowledge funding by the EU H2020 Research and Innovation Programme MSCA-RISE-2016 (grant agreement No. 734439 InferNet). AFB acknowledges funding by the European Research Council (ERC) under the EU H2020 Research and Innovation Programme (grant agreement No. 851173). AFB and MW thank the Institut de Biologie Paris-Seine (IBPS) at Sorbonne Université for funding via a Collaborative Grant (Action Incitative). This work was performed in part at the Aspen Center for Physics, which is supported by National Science Foundation grant PHY-1607611. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Sequence-driven modeling and prediction techniques for proteins have recently seen great advances, thanks to the combination of the rapidly growing amount of available protein-sequence data, with powerful statistical and machine learning techniques. Recently, AlphaFold brought a major advance in protein-structure prediction for monomeric proteins, provided that a sufficiently large number of homologous proteins can be found [1]. Indeed, AlphaFold starts from multiple-sequence alignments (MSAs) of homologs. Extensions to multimers and protein-protein interactions have been proposed, and they also start from MSAs of homologs of the proteins involved [2–4]. While these advances are impressive, many protein complexes remain unsolved by current computational means. A possible direction to improve them is to produce better co-alignments of interacting proteins (co-MSAs), where each row contains the concatenation of two interacting proteins. Indeed, co-MSAs allow to exploit correlations between interaction partners, which convey important information about binding specificity [5]. Beyond the perspective of improving quaternary protein-structure prediction, being able to accurately pair interacting paralogs is important to unveiling signaling networks and to understanding interaction specificity. Indeed, homologous signaling pathways in a given organism employ homologous mechanisms, but crosstalk between pathways may be unwanted. This is the case for instance in two-component systems in bacteria [6, 7], or calcium signaling in plants [8, 9].

Pairing interacting paralogs and obtaining co-MSAs is difficult because many protein families contain several paralogous proteins encoded within the same genome. Out of the almost 20,000 protein domain families in the Pfam database [10], now integrated into InterPro [11], more than 1300 have, on average, at least 10 paralogs per species, and about 400 more than 30 paralogs. These strongly amplified protein families include, beyond repeat domains which are not the most relevant here, many specifically interacting protein families like receptors, transporters, kinases etc., used, e.g., in signal transduction.

Therefore, even if we know that two protein families A and B interact, it can still be difficult to determine which particular paralog in family A interacts with which one in family B. In some cases, the problem can be solved using genomic co-localization of the protein-coding genes, as in operons in bacteria, and we will use such cases as benchmark cases for our algorithm. If this does not apply to the proteins studied, e.g. because they are not in the same operon, or because they are eukaryotic, the paralog-pairing problem becomes substantially more challenging. In practice, large-scale coevolution-based studies of inter-protein structural contacts [12] and protein-protein interactions [13, 14], as well as recent deep learning-based predictions of quaternary structures [2, 4], rely on co-MSAs constructed using genomic co-localization [12, 15] when possible, and orthology, determined by reciprocal closest matching sequences [2–4, 13, 14]. However, restricting to orthologs reduces co-MSA depth compared to using all paired paralogs.

Here, we propose to combine two important evolutionary signals, namely phylogeny and residue co-evolution, in order to improve paralog pairing. First, phylogeny, or in practice sequence similarity, can be helpful because when two proteins A and B interact in one species, and possess close homologs A’, B’ in a second species, then A’ and B’ are likely to also interact. More generally, the phylogenies of interacting protein families are expected to be similar. This idea has been used to discriminate interacting from non-interacting families starting from protein orthologs in the MirrorTree family of approaches [16–18], some of which address the paralog pairing problem [19–28]. The use of orthology for co-MSA construction [2, 4, 13, 14] also relies on this idea. One of us proposed to employ the similarity of phylogenies through neighbor-graph alignment for paralog pairing, but performance remained limited [26]. Second, interacting proteins coevolve (see [5, 29, 30] for reviews), which can be employed to predict interaction partners, as first demonstrated in a Bayesian network approach in [31]. Coevolutionary modeling approaches like Direct Coupling Analysis (DCA) [15, 32, 33] or Gremlin [12, 13] have been used successfully to infer inter-protein structural contacts from co-MSAs. More recently, some of us showed that iterative algorithms based on DCA allow paralog pairing [34, 35]. However, while these methods can function even without an initial seed co-MSA, their performance remains limited for high paralog numbers as well as for small datasets [34]. Interestingly, these coevolution-based methods already benefit from phylogenetic correlations [36, 37], and mutual information, which uses all forms of statistical dependences, performs slightly better than DCA for the pairing task [38]. These results show the potential of combining phylogenetic and coevolutionary signals. Coevolution-based methods are unlikely to exploit sequence similarity in an optimal way, because they are not designed for that.

Here, we show that explicitly combining these two signals substantially improves the pairing of paralogs between interacting protein families. We first use phylogenetic signal, by aligning orthology or sequence-similarity graphs, to produce an accurate co-MSA spanning a subset of the data. For this, we employ a stochastic algorithm based on simulated annealing to solve this graph-alignment (GA) problem. We compare the results of GA to those of a MirrorTree-based approach. We next use the partial co-MSA obtained from similarity graph alignment as a seed for the DCA-based iterative pairing algorithm (IPA). Our method, called GA-IPA as it combines GA and IPA, is illustrated in Fig 1. We obtain high-quality co-MSAs, outperforming those obtained by methods based on sequence similarity (phylogeny) or on coevolution alone. We use bacterial two-component systems as a specific example, and we also apply our approach to other protein families, thus showing its robustness and broader applicability.

Download:

Fig 1. GA-IPA approach for paralog pairing between two interacting protein families.

We show schematics to illustrate the key points of our approach (these schematics are conceptual and do not represent actual phylogenetic trees or similarity networks). (A) One starts from the separate MSAs of two interacting protein families A and B. Each species present in these MSAs may contain multiple paralogs in each family. Our goal is to infer which paralog in family A interacts with which paralog in family B. As indicated, two types of information will be used: phylogeny and residue coevolution. (B) We first construct a sequence-similarity network, specifically a k nearest-neighbor (kNN) or an orthology network, for each of the two families. These two networks are aligned to find a pairing of the sequences that maximises the similarity of the two networks, while only allowing pairs within the same species. Repeated runs of a stochastic graph-alignment (GA) algorithm based on simulated annealing allow to identify robust pairs, which are consistently paired across GA runs. (C) This partial but robustly paired MSA is used as an input to the iterative pairing algorithm (IPA) based on residue coevolution as detected by DCA. IPA iteratively extends the paired MSA until all sequences are paired. (D) The output full co-MSA is our prediction for the interacting protein pairs between families A and B.

https://doi.org/10.1371/journal.pcbi.1011010.g001

Results and discussion

Goal

We start from two MSAs comprising the sequences of two interacting protein families A and B (see Fig 1A). We assume that each species comprises the same number of paralogs of family A and of family B. While this is a simplification with respect to typical protein families, we make this choice for two reasons: (i) our benchmark data sets, generated using genome proximity, possess this property, and (ii) generalizing the method to the unbalanced case is rather straightforward, but the formulation is more involved. We therefore focus on the simpler case that is directly testable with our benchmark data.

Using these MSAs, and only these MSAs, we aim at constructing a bijective paralog pairing (or matching) which assigns to each sequence in family A one putative interaction partner in family B from the same species. Potential inter-species PPI are thus discarded by our algorithm, but could be detected afterwards with techniques similar to IPA.

In order to construct this pairing, we will use successively two distinct methods. The first one exploits phylogenetic relationships between interaction partners, via sequence similarity (i.e., it aims at identifying interologs). The second, and more involved one, relies on the inter-family coevolutionary signal as detectable by DCA (or mutual information). Both methods lead to computationally hard optimization problems (by contrast, e.g., to standard binary matching), which we approximately solve by heuristic techniques.

Aligning sequence-similarity networks to identify a subset of robust paralog pairings

In the first step, we aim at using phylogenetic relationships to identify potential interaction partners. Specifically, we exploit phylogenetic relations by constructing sequence-similarity networks for each of the two families A and B, and by aligning them together (see Materials and methods for details). We use the Hamming distance, which simply counts the amino-acid mismatches between two aligned sequences. It could be replaced by more sophisticated distance measures, including, e.g., amino-acid similarities or a position-specific weighting based on conservation in MSA columns. However, since our sequences are already well aligned (using profile Hidden Markov Models [39]), we expect that this would not substantially change our results. Sequence-similarity networks are weighted undirected graphs, where each node corresponds to a sequence. We consider two types of similarity networks. First, in the kNN network, each node is connected to its k nearest neighbors, where k is a parameter that will be varied in our study. Second, in the orthology network, each protein is connected to its orthologs, identified using best reciprocal hits, in other species. These networks are weighted in a way that gives stronger importance (i.e. larger weight) to edges between similar sequences than to edges between distant sequences.

Because of the similarities in the phylogenies of interacting proteins (see above), edges between the two similarity networks associated to each of the two families A and B tend to overlap, if the nodes representing partners are paired. We therefore map the search for a paralog pairing to a graph-alignment problem, where the vertices of the two graphs are paired to maximize the number of overlapping edges (see Materials and methods). To solve this difficult problem, we introduce a heuristic stochastic local search algorithm based on simulated annealing (cf. Materials and methods).

Fig 2 shows results for a benchmark set of M = 5064 bacterial two-component systems from N = 459 species, describing the interaction between histidine kinases (HK) and response regulators (RR), cf. Materials and methods for details of the dataset and the corresponding similarity networks. Similar results are obtained for other protein-family pairs, as we report in the Supporting information.

Download:

Fig 2. Performance of graph alignment.

The mean fraction of true-positive pairings (TP fraction) is shown as a function of the number k of nearest neighbors in the kNN graph for 100 GA realizations (blue). Performance for the orthology graph is also shown (red)—but note that it does not depend on k. Error bars (shaded regions) correspond to one standard deviation. The mean TP value of the IPA starting without any training set of known paired sequences is shown for comparison (yellow). It was obtained using N_increment = 6 and by averaging over 50 replicates that differ in their initial random pairings, cf. Materials and methods. The dotted black line shows the average TP fraction obtained for random HK-RR pairings within species (null model).

https://doi.org/10.1371/journal.pcbi.1011010.g002

For the orthology network, we find on average about 2/3 of correct pairings (true positives, TP), and 1/3 of incorrect pairings (false positives, FP), with higher TP fractions corresponding in general to lower GA cost, as is shown in the Supplementary information. This is much better than a random within-species pairing, which would have an average TP fraction of only N/M ≃ 9%. For the kNN network, we find a strong dependence on k. Indeed, for small k, only few edges can be aligned, since they connect different species, and the similarity networks contain little information about the correct paralog pairing. For larger values of k, the results slightly outperform those of the orthology network, with an average TP fraction of about 70% (see Fig 2). However, the increase of accuracy with increasing k comes with the higher computational cost of aligning denser similarity networks.

Fig 2 shows that our GA results reach, on this HK-RR dataset, a lower performance than the IPA starting from random matchings, i.e. without any training set of known paired sequences. Instead of sequence similarity networks, the IPA takes full sequences as input, and benefits from phylogenetic correlations, as we have shown recently [36, 37]. The IPA reaches on average 84% of TPs, outperforming even the best of the numerous GA runs done for Fig 2.

An interesting observation can be made when comparing the pairings π resulting from multiple runs of our stochastic GA algorithm. As is reported in Fig 3A, about 1300 HK sequences are paired across all runs with exactly the same RR protein. In this robust subset of paired sequences, a very high fraction of 99% of TPs is reached. Thus, almost all FPs appear among non-robust pairs, which differ from one GA run to the other. Fig 3B further shows that, while the number of robust pairings depends on the particular similarity network used, the TP fractions within the robust subset are always very high. Note that, in terms of the size and the quality of the robust subset, kNN networks outperform the orthology network even at moderate k for which the overall accuracy of the orthology networks for GA is still superior.

Download:

Fig 3. Robustness of GA.

(A) Robustness histograms for the 21-NN graph. We perform 100 GA runs and count how many times a HK is paired to the same RR. The horizontal axis gives the number of times a given pair appears, and the vertical axis is the number of pairs appearing that many times across replicates. Black bars are the total number of pairs, and red bars are the number of TPs. The TP ratios are indicated on top of the bars. (B) Number of robust pairs (occurring in all 100 GA runs) obtained by GA for each similarity network. The fraction of correctly matched pairs in this robust subset is indicated in each case.

https://doi.org/10.1371/journal.pcbi.1011010.g003

We also compare the results of our GA method to those of the related MirrorTree family of approaches [16–18], which aim to predict protein-protein interactions by exploiting sequence similarity (and thus phylogeny), see Materials and methods. Indeed, some MirrorTree-based methods have tackled the paralog pairing problem [19–28]. To make a comparison, we implemented such a method (see Materials and methods for details). Table A in S1 Text shows that the performance of GA and of MirrorTree is similar for several datasets, but we find a substantially better TP fraction using GA than MirrorTree for the standard HK-RR dataset with 〈m_p〉 = 11.03. While GA and MirrorTree share conceptual similarities, one key difference is that GA mainly focuses on close neighbor sequences, by restricting to the k nearest neighbors, or to reciprocal closest matches, and by giving exponentially decreasing weights to distant pairs (see Materials and methods). Conversely, MirrorTree relies on all pairs, including many informatively distant ones, which may become problematic for large datasets and with large numbers of paralogs per species. This may explain the difference observed for the HK-RR dataset. In the MirrorTree approach, one can exploit the stochasticity of the optimization to define robust pairs, exactly as for GA. Table B in S1 Text shows that MirrorTree yields similar numbers of robust pairs as GA but with a lower TP fraction.

To summarize, GA of the sequence-similarity networks of the two MSAs of interest allows to identify a robust subset of paired sequences and to construct a robust partial co-MSA with high accuracy. Furthermore, the accuracy of this robust partial co-MSA is higher when using GA than when using MirrorTree. Next, we employ this partial co-MSA as a starting point for the IPA, and use the IPA to extend it to a full co-MSA.

Using a robust partial co-MSA to seed the coevolution-based iterative pairing algorithm (IPA) yields accurate co-MSAs

The IPA was introduced in [34], and the idea of this method is shown in Fig 1C and detailed in Materials and methods. Briefly, this algorithm starts from a seed co-MSA and employs a coevolution-based DCA model [15, 32, 33] built on this seed co-MSA to score all possible within-species A-B pairs and propose a one-to-one matching of sequences. Proposed pairs with top confidence scores are added to the co-MSA to improve the DCA model and the predicted pairings, and this procedure is iterated, growing the co-MSA by N_increment concatenated sequences at each iteration, until a full co-MSA is obtained. The IPA thus constructs a pairing that approximately maximizes DCA coevolutionary signal.

The IPA can also be run without a seed co-MSA [34], as is done e.g. in Fig 2. This serves as a baseline comparison for our combined GA-IPA approach. In this case, a random pairing within each species is used to infer the first DCA model. Since this random matching has on average one correct pair per species, we expect some coevolutionary signal to emerge if the average number of paralogs per species 〈m_p〉 = M/N is not too large, but the total dataset depth M is large [40, 41]. The resulting DCA model is used to find the N_increment highest-scoring pairs, which in the second step are used to replace the randomly matched MSA. Iterations are then continued as described in the previous paragraph. The fact that signal adds more constructively than noise, and the fact that two interacting pairs from different species tend to be more similar to one another than to non-interacting pairs, both help to bootstrap IPA toward high performance starting from random pairings (see Fig 2). It has however been shown that the performance of IPA strongly increases when starting from a seed co-MSA, in particular for data sets with large numbers of paralogs or small depth, where starting from random pairings does not yield good performances [34].

This last observation is important in our context. Indeed, using GA without any seed co-MSA, we have constructed a robust partial co-MSA, which we found to be highly accurate. This partial co-MSA from GA can now be used as a seed co-MSA for IPA. We call this new method GA-IPA, since it combines GA and IPA.

Fig 4 shows the performance of GA-IPA for both the orthology graph and the kNN graphs with various k. We find that the results of our combined GA-IPA method are substantially better than those of each separate algorithm. GA-IPA even outperforms the already quite accurate IPA results (up to 90% TP fraction in GA-IPA vs. 84% in IPA). Second, results are very robust across different similarity networks. Even for the small value k = 3, when GA alone has a low TP rate and produces a small robust partial co-MSA, we obtain very good results: IPA is able to benefit even from pretty small seed co-MSAs.

Download:

Fig 4. GA-IPA outperforms both GA and IPA.

The mean fraction of true-positive pairings (TP ratio) is shown as a function of the number of nearest neighbors k in the kNN graph. We combine GA and IPA in our GA-IPA method: we use the robust pairs obtained by GA as a seed co-MSA for IPA. The results of GA and of IPA without seed co-MSA from Fig 2 are shown for comparison. For IPA, we use N_increment = 6, both without (IPA) and with (GA-IPA) seed co-MSA.

https://doi.org/10.1371/journal.pcbi.1011010.g004

We applied GA-IPA to other datasets corresponding to two other pairs of interacting protein families with different biological functions (ABC transporters and enzymes), as well as to two pairs of protein families with no known interaction but encoded in close proximity in prokaryotic genomes (see Materials and methods). Results are shown in Fig C in S1 Text. We find a very good performance of GA-IPA in all cases, showing its robustness, but the gain of performance compared to IPA alone is very limited. Indeed, the limited numbers of paralogs per species in these datasets make the IPA already very efficient and robust without any seed co-MSA. Note that the successful performance of IPA on pairs of protein families with no known interaction shows that it is already able to exploit phylogenetic signal [36, 37]. With GA, this signal is exploited more explicitly.

As mentioned above, another way of taking into account sequence similarity (and thus phylogeny) is to use MirrorTree approaches. We compare the results of our methods to variants using MirrorTree scores in Table C in S1 Text. We consider two types of such variants (see Materials and methods). First, we use MirrorTree to build the starting robust partial co-MSA (see above and Table B in S1 Text), and we run the DCA-based IPA starting from it. Second, we use MirrorTree scores instead of DCA scores within the IPA, either without a training set, or starting from robust partial co-MSAs from either GA or MirrorTree. We find that these MirrorTree-based variants perform slightly less well than the DCA-based IPA for all datasets, and significantly less well for the HK-RR dataset.

GA-IPA yields accurate co-MSAs for data sets with few sequences or high paralog multiplicities

In Fig 4 and Fig C in S1 Text, the benefit of using GA-IPA instead of the standard IPA without seed co-MSA remains limited. However, these are cases where the IPA already reaches a high accuracy without any seed co-MSA. We now address two hard cases where the IPA without seed co-MSA has poor performance.

The first difficult case we consider involves high average multiplicities 〈m_p〉 = M/N ≫ 1 of paralogs per species, making the pairing task very hard. To investigate this case, we constructed a data set by selecting species with large paralog numbers in our data set of two-component system protein sequences, cf. Materials and methods. The results for 〈m_p〉 = 29.2, i.e. for almost three times more paralogs per species than in the previous dataset, are shown in Fig 5A. In this case, a random matching has only 3.4% TPs. The IPA without seed alignment reaches 16% TP rate, which is better than random, but not sufficient for practical applications. GA alone already performs better in this case: kNN similarity networks with large enough k reach about 40% TP rate. An interesting result is obtained when combining the two: GA-IPA reaches almost the same TP rate (80–90%) as with the data set considered in Fig 4, which however had only an average of 11 paralogs per species. Besides, for this dataset, our approach performs substantially better than MirrorTree-based variants, see Table C in S1 Text. We conclude that GA-IPA is very robust to high paralog multiplicities. This provides a major improvement over previous approaches, and extends the applicability of paralog pairing to highly amplified protein families, cf. the Introduction.

Download:

Fig 5. Robust performance of GA-IPA in hard cases of paralog pairing.

(A) Same as Fig 4 but on a dataset having on average 29.2 paralogs per species, compared to 11.03 in Fig 4. While the performance of GA is substantially reduced compared to Fig 4, and that of IPA is even more reduced, GA-IPA achieves much larger TP fractions than GA and IPA. (B) Results of GA, IPA and GA-IPA for smaller datasets obtained by species subsampling from the full HK-RR data set, with 11.1 paralogs per species on average. We observe that GA-IPA needs almost one order of magnitude less sequences than IPA to reach comparable TP fractions.

https://doi.org/10.1371/journal.pcbi.1011010.g005

The second difficult case we consider is the case of small MSAs (i.e. MSAs with small M). To analyse this case, we randomly subsampled the species in the full HK-RR data set (see Materials and methods) to obtain smaller data sets, some being as small as M = 100 sequences. As can be seen in Fig 5B, the IPA without seed co-MSA has a strong M dependence, and several thousands of sequences are needed in each MSA to reach high TP rates above, e.g., 70–80%. For small M, GA performs a bit better, and its TP rate increases faster, outperforming the IPA by a factor larger than two for M on the order of a few hundreds. However, GA yields smaller TP rates for larger M (at constant k for the kNN graph), and GA performs substantially less well than the IPA for M on the order of a few thousands. Indeed, when M increases, more and more edges of the similarity network link distinct species in the two families, making the networks difficult to align. Crucially, we find that the combined GA-IPA always performs best, thus getting the best of both worlds. For very small data sets, GA-IPA does not lead to substantial accuracy gains over GA alone, because DCA needs sufficient data for accurate inference. But as soon as M is about a few hundreds, GA-IPA outperforms GA, and it does not suffer from the decay of GA performance at large M, as the extracted robust partial pairings of GA are sufficient to seed IPA, which itself performs better for larger M even without seed co-MSA. To reach TP rates of 70% or 80%, GA-IPA only needs about one thousand of sequences, compared to several thousands for the IPA without seed co-MSA.

In Fig D in S1 Text, we perform the same analysis as in Fig 5 with the mutual information (MI)-based IPA introduced in [38] instead of the DCA-based IPA [34]. It shows that our results hold for the MI-based IPA as well as for the DCA-based IPA. In both cases, using the robust partial seed co-MSA from GA yields a substantial increase of performance. Without a training set, the MI-based IPA requires slightly less data in total to achieve good performance than the DCA-based one [38]. For GA-IPA, this difference becomes smaller, but the MI-based IPA still very slightly outperforms the DCA-based one (see Fig D in S1 Text).

Computational time.

GA is usually less computationally costly than IPA (see Fig E in S1 Text). Furthermore, using a training set reduces the number of required IPA iterations compared to starting from no training set. However, in order to get a robust seed co-MSA, we employ multiple runs of GA, which increases the computational cost of the GA step, but is amenable to naive parallelism. Overall, our results show that using GA to provide a training set for the IPA yields important performance gains in difficult cases where there are few sequences or many paralogs per species, while IPA is already very good in other cases. Therefore, with computational cost in mind, we recommend the use of GA primarily in such difficult cases.

Conclusion

In this work, we have shown that the search for interacting paralogs between two protein families strongly benefits from combining two different sources of information, namely phylogeny (via sequence similarity) and residue coevolution, assessed either using DCA or mutual information. This is interesting because these two sources of information are very rarely explicitly combined in computational analyses of protein sequence family. Most phylogenetic analyses work under the assumption of independent-site evolution, i.e. they disregard coevolution between residues. Conversely, most coevolutionary studies effectively treat sequences as close to independent, and employ corrections to reduce the impact of phylogenetic correlations that could obscure functional coevolution.

While unifying these two signals in a single framework remains a hard problem, they have been shown to combine constructively in the inference of protein partners among paralogs by coevolution methods [37], raising the possibility that explicitly combining them may further increase performance. Here, we combine these two signals in a technically straightforward way. By using sequence-similarity networks as a proxy of phylogeny, we can formulate the paralog-pairing problem as a graph-alignment problem, which allows us to identify a subset of high-quality pairings. This partial but robust co-MSA can be used to inform coevolutionary modeling. Indeed, starting from a well-matched seed co-MSA is strongly beneficial to DCA-based paralog pairing.

Indeed, we find that our two-step strategy (GA-IPA), combining phylogenetic and coevolutionary information, leads to pairings of higher accuracy. Moreover, it is substantially more robust. In particular, it performs well even in situations where coevolution-based paralog pairing alone performs poorly, including shallow MSAs and families with high paralog multiplicities.

In some cases, more information is available, such as a few experimentally known interacting pairs, or genetic co-localization of the genes coding for interacting proteins (e.g. in bacterial operons; this information was used to construct our ground truth). It is easy to implement this kind of information into the graph alignment algorithm by modifying the parameters c_mn in Eq 1. Here, this parameter is only used to penalize inter-species pairings, but negative values could be used to favor or even impose pairings of specific sequences for which additional knowledge is available. This can be used to “nucleate” a graph alignment, and could make the robust GA-generated pairing larger and more accurate, thus further improving the performance of GA-IPA.

Here, for simplicity, we considered the simple case where each species comprises the same number of paralogs from each of the two families. However, in natural data, species include different number of paralogs from different families. In [35], this was addressed by using an injective pairing from the family with fewer paralogs into the family with more paralogs. Specifically, during the iterative procedure, in a species with more proteins A than proteins B, each protein B is matched one-to-one to a protein A so that the pairing score is maximized, and one discards the remaining proteins A. This only requires a minor change in our algorithm and could be employed in GA-IPA. A broader challenge, which should be investigated in future work, is to account for the possibility of partially promiscuous interactions: proteins from family A may interact with more than one protein from family B (but not all), and vice versa. This kind of promiscuity is frequently found in eukaryotic signaling systems.

Materials and methods

Definitions and notations

We start from two MSAs A and B for two interacting protein families A and B (see Fig 1A). A contains M protein sequences indexed by m = 1, …, M, and of aligned length L_A. These sequences belong to N < M distinct species, i.e., there are on average 〈m_p〉 = M/N > 1 paralogs per species, and the number of paralogs can vary across species, cf. Fig A in S1 Text. For simplicity, we assume that B contains the same number M of sequences but L_B can differ from L_A. We further assume that each species has the same number of paralogs of family A and of family B (see Discussion above).

We aim at constructing a bijective matching π: {1, …, M} → {1, …, M} called paralog pairing, which assigns to each sequence one putative interaction partner . We only consider intra-species PPI, which implies that for all m = 1, …, M, the indices m and π(m) belong to the same species.

Data sets

We consider as our primary benchmark a data set composed of 23,632 pairs of natural sequences of interacting histidine kinases (HK) and response regulators (RR) from the P2CS database [42, 43], as previously described in [34, 38]. In this data set, interacting partners are determined using proximity in the genome, derived from the annotations of the P2CS database. This allows us to assess partner inference performance in this natural data sets as well as in derived ones. Discarding the 208 pairs from species with only one such pair for which pairing is trivial yields a dataset of 23,424 HK-RR pairs with 11.1 paralogs per species on average.

In our first benchmark, we focus on a smaller “standard dataset” extracted from this complete dataset, in view of computational time constraints. The standard dataset was constructed by picking species randomly. It comprises 5064 pairs from 459 species comprising at least two HK-RR pairs, with an average number of pairs per species 〈m_p〉 = 11.03 [34]. To assess the impact of the number of HK-RR pairs per species on the success of GA-IPA, we constructed another dataset where the species with the highest numbers of pairs (25 to 41 in practice) were picked, yielding a dataset with M = 5052 pairs and 〈m_p〉 = 29.20 [34]. In both datasets, the actual paralog numbers vary strongly from species to species, going from the imposed minimum of two paralogs up to a maximum of 42 paralogs, cf. the histograms in Fig A in S1 Text. Finally, to assess the impact of varying M on the performance of inference by GA-IPA, we constructed smaller data sets by picking species randomly from the full data set [34].

We also consider a data set comprising 17,950 pairs of ABC transporter proteins homologous to the Escherichia coli MALG-MALK pair of maltose and maltodextrin transporters [12, 34] and extract a dataset of ∼5000 pairs from it. Similarly, we consider the homologs of interacting E. coli enzymes XDHA-XDHC, and retain the full dataset of ∼2000 pairs in this case. In these data sets, interacting partners are determined using proximity in the genome (as for HK-RR), following the approach from Ref. [12].

Since the approach presented here explicitly relies on phylogeny, it is interesting to also test it on proteins that share phylogeny without being interacting partners or having common functional constraints. While it is difficult to be certain that two protein families do not have common functional constraints, we picked two pairs of families that are encoded in close proximity on prokaryotic genomes but do not have known physical interactions [44]. They are the E. coli protein pairs LOLC-MACA and ACRE-ENVR and their homologs [36]. (Note that ENVR has regulatory roles on ACRE expression [45].) The datasets we employed for these pairs include ∼2000 homologous pairs.

Constructing sequence-similarity networks

Let us present the construction of sequence-similarity networks for one MSA (note that these networks are constructed in an equivalent way for each of the two MSAs that we want to pair).

A first step for the construction of sequence-similarity networks is the choice of a distance (or dissimilarity) measure d_mn for any pair (m, n) ∈ {1, …, M}² of homologous sequences. Here we choose the Hamming distance, which simply counts the amino-acid mismatches between the two aligned sequences (see discussion in Results and discussion).

Equipped with this distance measure between aligned sequences, we construct for each MSA a sequence-similarity network G = (V, E, w), defined as a weighted undirected graph with vertices (or nodes) V = 1, …, M, edges E and positive edge weights . As mentioned above, we consider two types of similarity networks, where the edges are extracted using the following two distinct procedures:

kNN network: In this network, each node is connected to its k nearest neighbors, possibly including links inside one species (connecting closely related paralogs), or multiple links between species. The kNN network is a priori directed (i.e. if n is a k-nearest neighbor of m, then m is not necessarily a k-nearest neighbor of n). Here, we disregard the directionality of edges, and retain an edge if it is present at least in one direction. We further merge possible double edges resulting from reciprocal choices. Therefore, nodes have degrees superior or equal to k, cf. Fig B in S1 Text for an example degree distribution. The parameter k is systematically varied in our analyses.
Orthology network: As a simple operational definition of orthology we use reciprocal best hits. For each protein in the MSA, we select its closest neighbor in each of the other species. We include an edge between two sequences if and only if this selection is reciprocal. Note that this construction can lead to very high degrees, cf. Fig B in S1 Text. For instance, in a species having a single sequence, this sequence is connected to all N − 1 other species. Contrarily to kNN networks, orthology networks do not contain any links inside species, even when close paralogy is present.

To complete the construction of G, we define the edge weights as , where d_mn is the distance between m and n, while D is the average distance (over all nodes) of the kth nearest neighbor in the case of the kNN network, and the average distance of the most distant ortholog in the case of the orthology network. The weight w_mn gives stronger importance (i.e. larger weight) to edges between similar sequences.

Aligning two sequence-similarity networks (GA)

We construct a network for each of the two families A and B, leading to two networks G^A = (V, E^A, w^A) and G^B = (V, E^B, w^B). Because of the similarities in the phylogenies of interacting proteins, if and are two interacting protein pairs, and and are close homologs (i.e. small ), then and are also expected to be similar (i.e. small ).

Thus, the search for a paralog pairing π can be mapped to a graph-alignment problem, where the vertices of the two graphs are paired to maximize the number of overlapping edges. To this end, we define a cost function (1) which, for any given pairing π: V → V, determines the negative of the total weight of all overlapping edges. The last term in the cost function is defined via (2) i.e. an infinite penalty is introduced for any pairing between proteins from different species. This term guarantees that only proteins from the same species are paired.

Finding the paralog pairing π with minimal cost is highly non-trivial. Here we employ a heuristic stochastic local search algorithm based on simulated annealing.

Approximately aligning two sequence-similarity networks by simulated annealing

Simulated annealing [46] is a heuristic optimization method aiming to find the state of a system that minimizes a cost function. In this approach, the amount of noise is gradually decreased via the temperature T, according to a predefined cooling protocol, until T is close to zero (corresponding to very little noise). This procedure reduces the risk of getting stuck at local minima of the cost function. At each temperature, Markov Chain Monte Carlo (MCMC) updates are run until the system is in thermal equilibrium. Here we use the following cooling protocol, known as the exponential schedule [47]: (3) where T is the temperature at simulation step t, while T₀ is the initial temperature and α is a coefficient satisfying 0 < α < 1.

Pseudocode for the simulated annealing procedure is given below.

Algorithm: Simulated annealing

0. Initialize the matching π by randomly pairing each sequence A with a sequence B from the same species.
Initialize temperature and number of steps: T = T₀ and t = 0.
1. Propose MCMC updates at temperature T, until a fixed number are accepted.
Each proposed update proceeds as follows:
* Randomly select one pair of sequences from the full dataset.
* Randomly select a second pair of sequences from the same species.
* Propose to exchange their interaction partners, yielding a new matching π′.
* Update the matching to π′ with probability .
2. Multiply the temperature T by a factor α, and increase t by one.
3. Repeat steps 1 and 2 until T reaches a target small value.

Coevolution-based iterative pairing algorithm (IPA)

The IPA, introduced in [34], starts from a seed co-MSA (“gold-standard set”), from which a pairwise maximum entropy model, also known as a DCA model [15, 32, 33] or as a Potts model, is inferred (see Fig 1C). Because this DCA model is built from a co-MSA, it is able to attribute a statistical energy score to any concatenated sequence composed of one sequence of family A and one of family B. This model is used to score all possible within-species A-B pairs that are not contained in the gold-standard set. These scores are used to perform a one-to-one matching of sequences within each species. The N_increment pairs with the top confidence score (based on an energy gap [34]) among these proposed pairs are then added to the gold-standard set to form an extended co-MSA. This extended co-MSA is then employed to infer a new DCA model, which is in turn used to re-score all pairs of sequences not belonging to the gold-standard set. The procedure is iterated, adding the nN_increment to the gold-standard set at iteration n, until the co-MSA is complete. This procedure heuristically constructs a pairing having high inter-protein DCA coevolutionary signal.

The IPA can also be run without a seed co-MSA. In this case, a random pairing within each species is used to infer the first DCA model. Contrarily to the case with a seed co-MSA where the seed co-MSA is not scored and left untouched, all pairs are always scored and re-paired at each iteration in this case.

In [38], a variant of the DCA-based IPA, based on pointwise mutual information, the MI-IPA, was introduced. It relies on the same iterative procedure, but uses scores based on mutual information instead of inferring DCA models. The MI-IPA reaches performances slightly higher the DCA-IPA on natural datasets, and is more robust to smaller datasets [38].

Matlab implementations of the DCA-based IPA and the MI-based IPA on our standard HK-RR dataset are freely available at https://doi.org/10.5281/zenodo.1421861 and https://doi.org/10.5281/zenodo.1421781, respectively. Julia implementations of the GA-IPA (as well as of the IPA), both DCA-based and MI-based are freely available at https://doi.org/10.5281/zenodo.7731108.

Comparison with MirrorTree-based methods

Several methods based on sequence similarity (and thus phylogeny) have been developed to predict protein-protein interactions. A prominent family of such methods, known as MirrorTree [16–18], assesses how similar the pairwise distance matrices between sequences are across two protein families to determine whether they interact. MirrorTree approaches use different ways to measure similarity matrices, the basic methodology uses the neighbor-joining algorithm implemented in ClustalW [48] for generating a phylogenetic tree for each protein family that is then used to compute pairwise distance matrices between all orthologs by summing the lengths of the branches separating the corresponding orthologs [16, 17, 19–25]. Distance measures were corrected for multiple hits allowing that distances can be more than 1 via multiple substitution per site [23] by using protdist from the phylip [49] package instead of ClustalW. Also, [50] incorporates information on the overall evolutionary histories of the species to correct distances between orthologues due to speciation events. While these methods often restrict to orthologs, some variants tackle the paralog pairing problem [19–28]. In several of these methods [19–21, 23, 24], the order of the sequences in one of the two families is modified to make the two distance matrices more similar. Similarity can be assessed e.g. via the Pearson correlation between the entries of the two distance matrices [21]. In this framework, consider the matrix (resp. ) with elements (resp. ) denoting the distance between sequences m and n of family A (resp. family B). The similarity score between and is then the Pearson correlation between the ordered list of all elements of on the one hand and the ordered list of all elements of on the other hand [21]. We consider this MirrorTree-based approach to the paralog matching problem and compare it to our graph alignment (GA) method in Table A in S1 Text. Optimization is performed via a Monte Carlo algorithm. One can exploit the stochasticity of this optimization by considering pairs that are robustly predicted over many replicates, exactly as for GA, see Table B in S1 Text.

Besides, as we proposed in [36], one can also use a variant of MirrorTree to address the paralog pairing problem in the IPA spirit. For this, we need to attribute a score to each possible within-species A-B pair, using a training co-MSA, which is the one of the gold-standard set at the first IPA step, or the extended co-MSA enriched with top predictions at the next ones (see above). In the MirrorTree approach, we use a Pearson correlation between Hamming distances observed in the two families to define this score. Specifically, let be the training co-MSA, which contains P pairs A-B. For each chain of the testing set, we compute the vector of Hamming distances between and each chain of the training set. We also compute an analogous vector for each chain of the testing set. Next, we define the pairing score of the possible pair A-B composed by and as the Pearson correlation between and . This score can be used for predicting partnerships in the IPA exactly as the DCA- and MI-based scores. In particular, to work fully in the MirrorTree spirit while using similar algorithms as in the rest of our work, we can start the MirrorTree-based IPA from a robust set of pairs predicted by MirrorTree (see paragraph above), yielding the results in Table C in S1 Text.

Supporting information

S1 Text. All supporting material is collected in a single supplementary text file.

https://doi.org/10.1371/journal.pcbi.1011010.s001

(PDF)

References

1. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. pmid:34265844
- View Article
- PubMed/NCBI
- Google Scholar
2. Humphreys IR, Pei J, Baek M, Krishnakumar A, Anishchenko I, Ovchinnikov S, et al. Computed structures of core eukaryotic protein complexes. Science. 2021;374:1340. pmid:34762488
- View Article
- PubMed/NCBI
- Google Scholar
3. Bryant P, Pozzati G, Elofsson A. Improved prediction of protein-protein interactions using AlphaFold2. Nat Commun. 2022;13(1):1265. pmid:35273146
- View Article
- PubMed/NCBI
- Google Scholar
4. Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. BioRxiv Preprint; p.
5. Szurmant H, Weigt M. Inter-residue, inter-protein and inter-family coevolution: bridging the scales. Current Opinion in Structural Biology. 2018;50:26–32. pmid:29101847
- View Article
- PubMed/NCBI
- Google Scholar
6. Hoch JA, Varughese K. Keeping signals straight in phosphorelay signal transduction. Journal of bacteriology. 2001;183(17):4941–4949. pmid:11489844
- View Article
- PubMed/NCBI
- Google Scholar
7. Laub MT, Goulian M. Specificity in two-component signal transduction pathways. Annu Rev Genet. 2007;41:121–145. pmid:18076326
- View Article
- PubMed/NCBI
- Google Scholar
8. Tang RJ, Wang C, Li K, Luan S. The CBL–CIPK calcium signaling network: unified paradigm from 20 years of discoveries. Trends in Plant Science. 2020;25(6):604–617. pmid:32407699
- View Article
- PubMed/NCBI
- Google Scholar
9. Zhang X, Li X, Zhao R, Zhou Y, Jiao Y. Evolutionary strategies drive a balance of the interacting gene products for the CBL and CIPK gene families. new phytologist. 2020;226(5):1506–1516. pmid:31967665
- View Article
- PubMed/NCBI
- Google Scholar
10. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer EL, et al. Pfam: The protein families database in 2021. Nucleic acids research. 2021;49(D1):D412–D419. pmid:33125078
- View Article
- PubMed/NCBI
- Google Scholar
11. Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, et al. InterPro in 2022. Nucleic Acids Research. 2022;.
- View Article
- Google Scholar
12. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife. 2014;3:e02030. pmid:24842992
- View Article
- PubMed/NCBI
- Google Scholar
13. Cong Q, Anishchenko I, Ovchinnikov S, Baker D. Protein interaction networks revealed by proteome coevolution. Science. 2019;365(6449):185–189. pmid:31296772
- View Article
- PubMed/NCBI
- Google Scholar
14. Green AG, Elhabashy H, Brock KP, Maddamsetti R, Kohlbacher O, Marks DS. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun. 2021;12(1):1396. pmid:33654096
- View Article
- PubMed/NCBI
- Google Scholar
15. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA. 2009;106(1):67–72. pmid:19116270
- View Article
- PubMed/NCBI
- Google Scholar
16. Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng Des Sel. 2001;14(9):609–614. pmid:11707606
- View Article
- PubMed/NCBI
- Google Scholar
17. Ochoa D, Pazos F. Studying the co-evolution of protein families with the Mirrortree web server. Bioinformatics. 2010;26(10):1370–1371. http://csbg.cnb.csic.es/mtserver. pmid:20363731
- View Article
- PubMed/NCBI
- Google Scholar
18. Ochoa D, Juan D, Valencia A, Pazos F. Detection of significant protein coevolution. Bioinformatics. 2015;31(13):2166–2173. http://csbg.cnb.csic.es/pMT/. pmid:25717190
- View Article
- PubMed/NCBI
- Google Scholar
19. Goh CS, Cohen FE. Co-evolutionary analysis reveals insights into protein–protein interactions. Journal of molecular biology. 2002;324(1):177–192. pmid:12421567
- View Article
- PubMed/NCBI
- Google Scholar
20. Ramani AK, Marcotte EM. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol. 2003;327(1):273–284. pmid:12614624
- View Article
- PubMed/NCBI
- Google Scholar
21. Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, et al. Inferring protein interactions from phylogenetic distance matrices. Bioinformatics. 2003;19(16):2039–2045. pmid:14594708
- View Article
- PubMed/NCBI
- Google Scholar
22. Izarzugaza JM, Juan D, Pons C, Ranea JA, Valencia A, Pazos F. TSEMA: interactive prediction of protein pairings between interacting families. Nucleic Acids Res. 2006;34(Web Server issue):W315–319. pmid:16845017
- View Article
- PubMed/NCBI
- Google Scholar
23. Tillier ER, Biro L, Li G, Tillo D. Codep: maximizing co-evolutionary interdependencies to discover interacting proteins. Proteins: Structure, Function, and Bioinformatics. 2006;63(4):822–831. pmid:16634043
- View Article
- PubMed/NCBI
- Google Scholar
24. Izarzugaza JM, Juan D, Pons C, Pazos F, Valencia A. Enhancing the prediction of protein pairings between interacting families using orthology information. BMC Bioinformatics. 2008;9:35. pmid:18215279
- View Article
- PubMed/NCBI
- Google Scholar
25. Tillier ER, Charlebois RL. The human protein coevolution network. Genome Res. 2009;19(10):1861–1871. pmid:19696150
- View Article
- PubMed/NCBI
- Google Scholar
26. Bradde S, Braunstein A, Mahmoudi H, Tria F, Weigt M, Zecchina R. Aligning graphs and finding substructures by a cavity approach. EPL. 2010;89(3).
- View Article
- Google Scholar
27. Hajirasouliha I, Schönhuth A, de Juan D, Valencia A, Sahinalp SC. Mirroring co-evolving trees in the light of their topologies. Bioinformatics. 2012;28(9):1202–1208. pmid:22399677
- View Article
- PubMed/NCBI
- Google Scholar
28. El-Kebir M, Marschall T, Wohlers I, Patterson M, Heringa J, Schonhuth A, et al. Mapping proteins in the presence of paralogs using units of coevolution. BMC Bioinformatics. 2013;14 Suppl 15:S18. pmid:24564758
- View Article
- PubMed/NCBI
- Google Scholar
29. De Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nature Reviews Genetics. 2013;14(4):249–261. pmid:23458856
- View Article
- PubMed/NCBI
- Google Scholar
30. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep Prog Phys. 2018;81(3):032601. pmid:29120346
- View Article
- PubMed/NCBI
- Google Scholar
31. Burger L, van Nimwegen E. Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008;4:165. pmid:18277381
- View Article
- PubMed/NCBI
- Google Scholar
32. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE. 2011;6(12):e28766. pmid:22163331
- View Article
- PubMed/NCBI
- Google Scholar
33. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011;108(49):E1293–1301. pmid:22106262
- View Article
- PubMed/NCBI
- Google Scholar
34. Bitbol AF, Dwyer RS, Colwell LJ, Wingreen NS. Inferring interaction partners from protein sequences. Proc Natl Acad Sci USA. 2016;113(43):12180–12185. pmid:27663738
- View Article
- PubMed/NCBI
- Google Scholar
35. Gueudre T, Baldassi C, Zamparo M, Weigt M, Pagnani A. Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis. Proc Natl Acad Sci USA. 2016;113(43):12186–12191. pmid:27729520
- View Article
- PubMed/NCBI
- Google Scholar
36. Marmier G, Weigt M, Bitbol AF. Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput Biol. 2019;15(10):e1007179. pmid:31609984
- View Article
- PubMed/NCBI
- Google Scholar
37. Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol. 2022;18(5):e1010147. pmid:35576238
- View Article
- PubMed/NCBI
- Google Scholar
38. Bitbol AF. Inferring interaction partners from protein sequences using mutual information. PLoS Comput Biol. 2018;14(11):e1006401. pmid:30422978
- View Article
- PubMed/NCBI
- Google Scholar
39. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. pmid:9918945
- View Article
- PubMed/NCBI
- Google Scholar
40. Malinverni D, Jost Lopez A, De Los Rios P, Hummer G, Barducci A. Modeling Hsp70/Hsp40 interaction by multi-scale molecular simulations and coevolutionary sequence analysis. Elife. 2017;6. pmid:28498104
- View Article
- PubMed/NCBI
- Google Scholar
41. Gandarilla-Pérez CA, Mergny P, Weigt M, Bitbol AF. Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences. Phys Rev E. 2020;101:032413. pmid:32290011
- View Article
- PubMed/NCBI
- Google Scholar
42. Barakat M, Ortet P, Jourlin-Castelli C, Ansaldi M, Mejean V, Whitworth DE. P2CS: a two-component system resource for prokaryotic signal transduction research. BMC Genomics. 2009;10:315. pmid:19604365
- View Article
- PubMed/NCBI
- Google Scholar
43. Barakat M, Ortet P, Whitworth DE. P2CS: a database of prokaryotic two-component systems. Nucleic Acids Res. 2011;39(Database issue):D771–776. pmid:21051349
- View Article
- PubMed/NCBI
- Google Scholar
44. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–D613. pmid:30476243
- View Article
- PubMed/NCBI
- Google Scholar
45. Hirakawa H, Takumi-Kobayashi A, Theisen U, Hirata T, Nishino K, Yamaguchi A. AcrS/EnvR represses expression of the acrAB multidrug efflux genes in Escherichia coli. J Bacteriol. 2008;190(18):6276–6279. pmid:18567659
- View Article
- PubMed/NCBI
- Google Scholar
46. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–680. pmid:17813860
- View Article
- PubMed/NCBI
- Google Scholar
47. Hartmann AK, Weigt M. Phase transitions in combinatorial optimization problems: basics, algorithms and statistical mechanics. John Wiley and Sons; 2006.
48. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. pmid:7984417
- View Article
- PubMed/NCBI
- Google Scholar
49. Felsenstein, J. PHYLIP (Phylogeny Inference Package). Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
50. Pazos F, Ranea JA, Juan D, Sternberg MJ. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. Journal of Molecular Biology. 2005;352(4):1002–1015. pmid:16139301
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. pmid:34265844
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Humphreys IR, Pei J, Baek M, Krishnakumar A, Anishchenko I, Ovchinnikov S, et al. Computed structures of core eukaryotic protein complexes. Science. 2021;374:1340. pmid:34762488
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Bryant P, Pozzati G, Elofsson A. Improved prediction of protein-protein interactions using AlphaFold2. Nat Commun. 2022;13(1):1265. pmid:35273146
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. BioRxiv Preprint; p.

[ref5] 5. Szurmant H, Weigt M. Inter-residue, inter-protein and inter-family coevolution: bridging the scales. Current Opinion in Structural Biology. 2018;50:26–32. pmid:29101847
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Hoch JA, Varughese K. Keeping signals straight in phosphorelay signal transduction. Journal of bacteriology. 2001;183(17):4941–4949. pmid:11489844
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Laub MT, Goulian M. Specificity in two-component signal transduction pathways. Annu Rev Genet. 2007;41:121–145. pmid:18076326
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Tang RJ, Wang C, Li K, Luan S. The CBL–CIPK calcium signaling network: unified paradigm from 20 years of discoveries. Trends in Plant Science. 2020;25(6):604–617. pmid:32407699
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Zhang X, Li X, Zhao R, Zhou Y, Jiao Y. Evolutionary strategies drive a balance of the interacting gene products for the CBL and CIPK gene families. new phytologist. 2020;226(5):1506–1516. pmid:31967665
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer EL, et al. Pfam: The protein families database in 2021. Nucleic acids research. 2021;49(D1):D412–D419. pmid:33125078
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, et al. InterPro in 2022. Nucleic Acids Research. 2022;.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife. 2014;3:e02030. pmid:24842992
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Cong Q, Anishchenko I, Ovchinnikov S, Baker D. Protein interaction networks revealed by proteome coevolution. Science. 2019;365(6449):185–189. pmid:31296772
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref14] 14. Green AG, Elhabashy H, Brock KP, Maddamsetti R, Kohlbacher O, Marks DS. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun. 2021;12(1):1396. pmid:33654096
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref15] 15. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA. 2009;106(1):67–72. pmid:19116270
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref16] 16. Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng Des Sel. 2001;14(9):609–614. pmid:11707606
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref17] 17. Ochoa D, Pazos F. Studying the co-evolution of protein families with the Mirrortree web server. Bioinformatics. 2010;26(10):1370–1371. http://csbg.cnb.csic.es/mtserver. pmid:20363731
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref18] 18. Ochoa D, Juan D, Valencia A, Pazos F. Detection of significant protein coevolution. Bioinformatics. 2015;31(13):2166–2173. http://csbg.cnb.csic.es/pMT/. pmid:25717190
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref19] 19. Goh CS, Cohen FE. Co-evolutionary analysis reveals insights into protein–protein interactions. Journal of molecular biology. 2002;324(1):177–192. pmid:12421567
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref20] 20. Ramani AK, Marcotte EM. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol. 2003;327(1):273–284. pmid:12614624
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref21] 21. Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, et al. Inferring protein interactions from phylogenetic distance matrices. Bioinformatics. 2003;19(16):2039–2045. pmid:14594708
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref22] 22. Izarzugaza JM, Juan D, Pons C, Ranea JA, Valencia A, Pazos F. TSEMA: interactive prediction of protein pairings between interacting families. Nucleic Acids Res. 2006;34(Web Server issue):W315–319. pmid:16845017
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref23] 23. Tillier ER, Biro L, Li G, Tillo D. Codep: maximizing co-evolutionary interdependencies to discover interacting proteins. Proteins: Structure, Function, and Bioinformatics. 2006;63(4):822–831. pmid:16634043
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref24] 24. Izarzugaza JM, Juan D, Pons C, Pazos F, Valencia A. Enhancing the prediction of protein pairings between interacting families using orthology information. BMC Bioinformatics. 2008;9:35. pmid:18215279
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref25] 25. Tillier ER, Charlebois RL. The human protein coevolution network. Genome Res. 2009;19(10):1861–1871. pmid:19696150
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref26] 26. Bradde S, Braunstein A, Mahmoudi H, Tria F, Weigt M, Zecchina R. Aligning graphs and finding substructures by a cavity approach. EPL. 2010;89(3).
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref27] 27. Hajirasouliha I, Schönhuth A, de Juan D, Valencia A, Sahinalp SC. Mirroring co-evolving trees in the light of their topologies. Bioinformatics. 2012;28(9):1202–1208. pmid:22399677
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref28] 28. El-Kebir M, Marschall T, Wohlers I, Patterson M, Heringa J, Schonhuth A, et al. Mapping proteins in the presence of paralogs using units of coevolution. BMC Bioinformatics. 2013;14 Suppl 15:S18. pmid:24564758
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref29] 29. De Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nature Reviews Genetics. 2013;14(4):249–261. pmid:23458856
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref30] 30. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep Prog Phys. 2018;81(3):032601. pmid:29120346
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref31] 31. Burger L, van Nimwegen E. Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008;4:165. pmid:18277381
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref32] 32. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE. 2011;6(12):e28766. pmid:22163331
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref33] 33. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011;108(49):E1293–1301. pmid:22106262
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref34] 34. Bitbol AF, Dwyer RS, Colwell LJ, Wingreen NS. Inferring interaction partners from protein sequences. Proc Natl Acad Sci USA. 2016;113(43):12180–12185. pmid:27663738
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref35] 35. Gueudre T, Baldassi C, Zamparo M, Weigt M, Pagnani A. Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis. Proc Natl Acad Sci USA. 2016;113(43):12186–12191. pmid:27729520
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref36] 36. Marmier G, Weigt M, Bitbol AF. Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput Biol. 2019;15(10):e1007179. pmid:31609984
View Article
PubMed/NCBI
Google Scholar

[137] View Article

[138] PubMed/NCBI

[139] Google Scholar

[ref37] 37. Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol. 2022;18(5):e1010147. pmid:35576238
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref38] 38. Bitbol AF. Inferring interaction partners from protein sequences using mutual information. PLoS Comput Biol. 2018;14(11):e1006401. pmid:30422978
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref39] 39. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. pmid:9918945
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref40] 40. Malinverni D, Jost Lopez A, De Los Rios P, Hummer G, Barducci A. Modeling Hsp70/Hsp40 interaction by multi-scale molecular simulations and coevolutionary sequence analysis. Elife. 2017;6. pmid:28498104
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref41] 41. Gandarilla-Pérez CA, Mergny P, Weigt M, Bitbol AF. Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences. Phys Rev E. 2020;101:032413. pmid:32290011
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref42] 42. Barakat M, Ortet P, Jourlin-Castelli C, Ansaldi M, Mejean V, Whitworth DE. P2CS: a two-component system resource for prokaryotic signal transduction research. BMC Genomics. 2009;10:315. pmid:19604365
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref43] 43. Barakat M, Ortet P, Whitworth DE. P2CS: a database of prokaryotic two-component systems. Nucleic Acids Res. 2011;39(Database issue):D771–776. pmid:21051349
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref44] 44. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–D613. pmid:30476243
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref45] 45. Hirakawa H, Takumi-Kobayashi A, Theisen U, Hirata T, Nishino K, Yamaguchi A. AcrS/EnvR represses expression of the acrAB multidrug efflux genes in Escherichia coli. J Bacteriol. 2008;190(18):6276–6279. pmid:18567659
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref46] 46. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–680. pmid:17813860
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref47] 47. Hartmann AK, Weigt M. Phase transitions in combinatorial optimization problems: basics, algorithms and statistical mechanics. John Wiley and Sons; 2006.

[ref48] 48. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. pmid:7984417
View Article
PubMed/NCBI
Google Scholar

[182] View Article

[183] PubMed/NCBI

[184] Google Scholar

[ref49] 49. Felsenstein, J. PHYLIP (Phylogeny Inference Package). Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.

[ref50] 50. Pazos F, Ranea JA, Juan D, Sternberg MJ. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. Journal of Molecular Biology. 2005;352(4):1002–1015. pmid:16139301
View Article
PubMed/NCBI
Google Scholar

[187] View Article

[188] PubMed/NCBI

[189] Google Scholar

Figures

Abstract

Author summary

Introduction

Results and discussion

Goal

Aligning sequence-similarity networks to identify a subset of robust paralog pairings

Using a robust partial co-MSA to seed the coevolution-based iterative pairing algorithm (IPA) yields accurate co-MSAs

GA-IPA yields accurate co-MSAs for data sets with few sequences or high paralog multiplicities

Computational time.

Conclusion

Materials and methods

Definitions and notations

Data sets

Constructing sequence-similarity networks

Aligning two sequence-similarity networks (GA)

Approximately aligning two sequence-similarity networks by simulated annealing

Coevolution-based iterative pairing algorithm (IPA)

Comparison with MirrorTree-based methods

Supporting information

S1 Text. All supporting material is collected in a single supplementary text file.

References