A multi-scale coevolutionary approach to predict interactions between protein domains

Interacting proteins and protein domains coevolve on multiple scales, from their correlated presence across species, to correlations in amino-acid usage. Genomic databases provide rapidly growing data for variability in genomic protein content and in protein sequences, calling for computational predictions of unknown interactions. We first introduce the concept of direct phyletic couplings, based on global statistical models of phylogenetic profiles. They strongly increase the accuracy of predicting pairs of related protein domains beyond simpler correlation-based approaches like phylogenetic profiling (80% vs. 30–50% positives out of the 1000 highest-scoring pairs). Combined with the direct coupling analysis of inter-protein residue-residue coevolution, we provide multi-scale evidence for direct but unknown interaction between protein families. An in-depth discussion shows these to be biologically sensible and directly experimentally testable. Negative phyletic couplings highlight alternative solutions for the same functionality, including documented cases of convergent evolution. Thereby our work proves the strong potential of global statistical modeling approaches to genome-wide coevolutionary analysis, far beyond the established use for individual protein complexes and domain-domain interactions.

A Input data The starting point of our analysis is the phylogenetic profile matrix (PPM): a binary matrix (n a i ) a=1,...,M i=1,...,N whose entries capture the presence (n a i = 1) or absence (n a i = 0) of a domain i in genome a, with a = 1, . . . , M (M being the number of genomes) and i = 1, . . . , N (N being the number of domains). As discussed in the main text, the domains (the columns of the PPM) are then compared with each other to look for functionally related domains. The data we use are extracted from the Pfam 30.0 database (version of July 2016) [2], and assigned to bacterial or eukaryotic species using the Uniprot species list available on (http://www.uniprot.org/docs/speclist).

B Similarity measures
In standard phylogenetic profiling [5] the correlations between the columns (n i , n j ) describing a pair of domains are usually evaluated via the Hamming distance, Pearson correlation or the p-value of the Fisher's exact test. We briefly describe each below.
• Hamming distance: counts the number of bits which differ between two binary strings n i , n j divided by the total number of domains, i.e. the number of species containing exactly one of the two domains, (2) • p-value of Fisher Test: for each couple (n i , n j ) we construct an auxiliary 2 × 2 matrix M 1,1 M 1,2 M 2,1 M 2,2 with: We define R i = j M i,j , C j = i M i,j , and we have M = i,j M i,j . We calculate the conditional probability of getting the actual matrix given the particular row and column sums: which is a multivariate generalization of the hyper-geometric distribution. Theoretically, we analyse all the matrices of non negative integers consistent with the marginals R i , C j and M , and for each of them we calculate the p-value Eq.(4). The p-value of the test is the sum of all p-values which are P ≤ P cutof f . Small p-values thus indicate atypical cases related to correlations between the distributions of the two domains across species.

C Inference techniques
As a simple but meaningful statistical model we consider a pairwise lattice-gas Ising model with the following Hamiltonian: It shall model the statistical variability of the phylogenetic profile n = (n 1 , ..., n N ) of an entire species. The binary variables n i (i = 1, . . . , N , with N being the total number of domains) can have values n i = 1 i.e. the domain is present, or n i = 0 otherwise. Each row of the PPM has this form, and H can thus be evaluated for each species contained in the PPM.
To infer the Ising model Eq. (5) reproducing the empirical one-and twocolumn statistics, we assume as working hypothesis that the phylogenetic profile matrix PPM is composed by configuration sampled from the equilibrium Boltzmann-Gibbs distribution (in the inference process we are free to set the inverse temperature β = 1): In the inference procedure we look for parameters {h, J} (respectively the fields and the phyletic couplings) that maximize the log-likelihood: where w m are reweighting coefficients defined below in Section C.1. Nevertheless the exact maximization procedure requires the determination of the marginals of P (n) for single variables and pairs, which is highly computationally expensive.
In the last years [9] a variety of efficient and accurate approximation methods have been proposed: in this work we use the Mean Field (MF) and pseudolikelihood maximization (PLM) approximations, which, as showed in Figure 2 of the main text, give similar results on our datasets. We briefly describe each one below. We also mention that we developed an implementation of Phyletic-Coupling Analysis PhyDCA, in Julia, a new open-source language [15]. The package can be downloaded at https://github.com/GiancarloCroce/PhyCA.

C.1 Pre-processing the data
In order to reduce sampling biases, a simple correction is to associate to each phylogenetic profile of a species n a the weight w a = 1/m a with m a the number of similar profiles (including a itself): with d H (n a , n b ) being the Hamming distance and θ a similarity threshold. It has been empirically observed that setting θ = 0.8 slightly improves the PPV. The sum of all the weights M ef f = M a=1 w a , represents the effective number of phylogenetic profiles counted as independent in the analysis.

C.2 Mean Field (MF)
The Mean Field approximation is currently the fastest approximate inference scheme even if not accurate when the number of samples is small nor when the interactions are strong [9]. The phyletic couplings J ij are inferred from the empirical correlation matrix: where λ is a pseudocount that we set to λ = 1 4 which we empirically found to produce the optimal PPV.

C.3 Pseudolikehood (PLM)
Adapted in [4] to protein sequences, the approximation consists in replacing the Boltzmann-Gibbs measure with the following conditional probability distribution: with n \i = (n 1 , ..., n i−1 , n i+1 , ..., n N ) being the phylogenetic profile of all domains different from i. For a given data PPM, the likelihood with respect to the distribution Eq.(12) -usually called pseudo-likelihood - can be easily maximized as a function of (J i,\i , h i ). As is customary in inference problems, we add an 2 regularization term γ h i h 2 i + γ J i =j J 2 ij to mitigate the effects of overfitting by penalizing parameters with large value. The free parameters γ J and γ h are set to γ J = 0.02 and γ h = 0.05. We refer to [4] for the details of the implementation, which was adapted to binary variables.

C.4 Average product correction (APC)
Another correction, introduced in [13], consists in subtracting from the inferred couplings J ij a contribution due to the single-site properties of i and j (the average over sites is denoted by •): Since the APC correction aims to minimize background influences like phylogeny and site entropy, we included it in PhyDCA. While for residue-level DCA, APC-corrected scores have a significantly better prediction accuracy [9], an a posteriori analysis show that the effect is small and almost negligible in PhyDCA (see Figure A).  Figure B shows the histograms of the similarity measures (Hamming distance, Pearson correlation, the P-value of the Fisher's exact test and phyletic couplings) for all the pairs of domains considered in the main text. Related domains ought to have profiles of high phyletic couplings, high correlations, low p-value of Fisher's exact test or low Hamming distances, since the first two are similarity, the second two more dissimilarity measures. In Figure C we plot the phyletic-couplings J ij found using the mean-field (MF) and the pseudo-likelihood maximization (PLM) approximations. Predictions are done by sorting all couplings in decreasing order. They are evidently highly similar, but include a partial reordering. To extract Figure 2 of the main text, the blue dots are interpreted as false positives. From Figure C it is evident that these false positives are -even for very large coupling values -similarly distributed for the two approximations in between the true positives (red points), therefore showing that none of the methods has a clear advantage in precision.  The kink in the histograms between the bulk of small J ij and the tail of large J ij is observed to provide a good cutoff value for high-quality predictions. Once again, it is located close to J plm = 0.5 or (J M F = 1.5).

E Progressive paralog matching procedure
As discussed in the main text, a large positive phyletic coupling is a strong indicator of a functional relationship between two domains, but not necessarily of a direct physical interaction. To identify physical interactions we use the procedure introduced in [17], which studies coevolution of domain pairs at the level of the individual residues. We briefly describe the method here; for a full description and additional details we refer to the original publication [17].
Given a pair of strongly coupled domains and the corresponding multiple sequence alignments (say M SA 1 and M SA 2 ) the problem is to generate a concatenated alignment: we need to find for any sequences belonging to M SA 1 a matching partner in M SA 2 . Only sequences belonging to the same species should be matched. This problem has a trivial solution (which we call matching by uniqueness) when there are no paralogs and each species has only one sequence in each MSA. However, many protein domains exist in multiple paralogs across many species. In this case, the method proposed in [17] is based on the idea that the correct matching of interacting paralogs should maximize the inter-domain coevolutionary signal. The matched MSA is than used to identify interacting protein families: an average of the four highest inter-protein residue-residue plmDCA scores larger than 0.2 is a strong indicator for a potential interaction, at least of the joint MSA has an effective size M ef f > 200.
The following procedure was used to create the matched-alignment and compute the DCA scores: 1. download the two PFAM alignments (M SA 1 and M SA 2 );

if considering a phylogenetic profile matrix with bacterial [eukaryotic]
genomes, select only bacterial [eukaryotic] sequences; 3. run the progressive paralog-matching (PPM) to find a locally optimal matching; 4. run plmDCA and compute the DCA score, given by the average of the first four highest interdomain residue-residue plmDCA scores.
In Tables A and B we report the first 100 strongly phyletically coupled domain pairs inside the E. coli K-12 MG1655 strain not belonging to our list of positive known domain-domain relations (i.e. our predictions for such relations). Figure E and Figure F show the results of the matching procedure for the domain pairs inside the E. coli K-12 MG1655 strain.

domains in iPfam
The panel on the left shows the result of the matching procedure for the 200 pairs of highest phylogenetic couplings belonging to the iPfam database.
The complete list can be found on the Github page at results/ECOLI_matching_iPfam_results.dat . On the right, as a comparison, a random matching for the same domain pairs. Figure G show the results of residue-level DCA for the 200 domain pairs of strongest phyletic couplings, which are co-localized in one protein in E. coli. Due to the co-localization, the generation of a joint MSA is trivial in this case; the paralogs-matching can be avoided. Note that domains can co-occur in the same protein without direct physical interactions. Out of the 200 pairs, 144 of these domain pairs are also listed in iPfam, meaning that a direct physical interaction is structurally known.

E.1 Network analysis
As stated in the main text, CoPAP and PhyDCA treat very different confounding factors of coevolutionary analysis -phylogenetic biases and indirect correlations. Nevertheless from Figure 3 of the main text, it appears that almost none of the correlated pairs strongly coupled in PhyDCA, are actually discarded by CoPAP. But are the correlations of pairs, which are retained by CoPAP as non phyletically coupled, but discarded by PhyDCA, really an indirect network effect of the PhyDCA couplings?
To answer this question, we first introduce in Fig. H two scatter plots of the phyletic couplings vs. Pearson correlations between domain pairs, in the first case for the 3611 domain pairs of highest CoPAP score, in the second case for all domain pairs. In both cases, we see a clear triangular shape, indicating that large couplings lead to large correlations, but large correlations can exist between weakly coupled pairs. Since our PhyDCA model reproduces correlations using couplings, the latter case must result from indirect correlations. Also as a consequence, the phyletic coupling network is substantially sparser than the correlation network. To corroborate this, in Fig. I, we consider the network of the 1000 strongest phyletic couplings and study the correlations as a function of the shortest-path distance between domains along this network. Correlations decrease with distance until they saturate at a low but non-zero level. This is coherent with the idea that empirical correlations found in the data have at least three contributions -direct correlations induced by direct couplings (at distance 1), indirect couplings induced by coupling chains, and a ground level of correlations, which possibly result from phylogenetic correlations between the species and other sampling effects. If we take alternatively the network induced by the 1613 pairs, which have large Pearson correlations and are preserved by CoPAP (the intersection of the red and green circles in Fig. 3A in the main text), we also find a correlation decrease (as to be expected in any sparsely connected graphical model), cf. Fig. J. However, the decay is slower than on the PhyDCA network, even if the network is denser. Pairs in the PhyDCA network are thus less correlated than pairs at the same distance in the correlation network, which shows that the phyletic coupling network more parsimoniously explains the connectivity patterns present in the data.

F All bacteria
In the main text we use the model organism Escherichia coli as reference genome in order to have a large set of known domain-domain relationships. In this section we consider a broader selection of genomes by applying the same methodology to all 9,358 Pfam domains appearing in bacteria. To access the accuracy of our prediction we compile a number of known domain-domain relationships: intra-protein localization (out of 2,972,104 proteins 866,591 contain multiple domains, giving rise to 26,381 distinct domain-domain relations), domaindomain contacts in 3d structures (from the iPfam database, for a total of 545 known relationships), protein-protein interaction (from the IntAct database, obtaining 67,409 domain pairs). This leads to a total of 92,428 known relationship (cf. Figure K, Panel A).
We then select the couplings between domains which are only present in E. coli genome ( cf. Figure K, Panel B and C) finding 96% correlation with the couplings inferred in the main text, thus proving the robustness of the results with respect to the selection of domains. Figure K: Phylogenetic couplings Panel A shows the PPV of the phyletic couplings of all bacterial domains for predicting domain-domain relationships (including protein architecture, iPfam and IntAct entries). Panel B shows a histogram of couplings J ij , as inferred by PLM, for the domains present in all bacteria and for those appearing only in E. coli. In Panel C we retain from the bacterial phyletic couplings only the couplings between domains present in E. coli. Then we compare them with the couplings found by the procedure described in the main text, finding a correlation of 96% between the two.
We have applied the paralog-matching analysis to the 200 most coupled bacterial domain pairs (see Figure L ). A list of the domain pairs, their phylogenetic coupling and the DCA score can be found on the Github page at results/ALLBACTERIA_matching_results.dat).

F.1 Negative phyletic couplings
As discussed in the main text, a negative phyletic coupling disfavors the joint presence of two domains in the same genome and thereby highlights alternative solutions for the same functionality. In Table D we report the 100 most strongly negatively coupled domain pairs with their PFAM description.

G Eukaryotic genomes: human as reference species
The complete list of results of our analyisis applied on eukaryotic genomes using human as reference species can be found on the Github page at results/HUMAN_matching_results.