PPIDomainMiner: Inferring domain-domain interactions from multiple sources of protein-protein interactions

Many biological processes are mediated by protein-protein interactions (PPIs). Because protein domains are the building blocks of proteins, PPIs likely rely on domain-domain interactions (DDIs). Several attempts exist to infer DDIs from PPI networks but the produced datasets are heterogeneous and sometimes not accessible, while the PPI interactome data keeps growing. We describe a new computational approach called “PPIDM” (Protein-Protein Interactions Domain Miner) for inferring DDIs using multiple sources of PPIs. The approach is an extension of our previously described “CODAC” (Computational Discovery of Direct Associations using Common neighbors) method for inferring new edges in a tripartite graph. The PPIDM method has been applied to seven widely used PPI resources, using as “Gold-Standard” a set of DDIs extracted from 3D structural databases. Overall, PPIDM has produced a dataset of 84,552 non-redundant DDIs. Statistical significance (p-value) is calculated for each source of PPI and used to classify the PPIDM DDIs in Gold (9,175 DDIs), Silver (24,934 DDIs) and Bronze (50,443 DDIs) categories. Dataset comparison reveals that PPIDM has inferred from the 2017 releases of PPI sources about 46% of the DDIs present in the 2020 release of the 3did database, not counting the DDIs present in the Gold-Standard. The PPIDM dataset contains 10,229 DDIs that are consistent with more than 13,300 PPIs extracted from the IMEx database, and nearly 23,300 DDIs (27.5%) that are consistent with more than 214,000 human PPIs extracted from the STRING database. Examples of newly inferred DDIs covering more than 10 PPIs in the IMEx database are provided. Further exploitation of the PPIDM DDI reservoir includes the inventory of possible partners of a protein of interest and characterization of protein interactions at the domain level in combination with other methods. The result is publicly available at http://ppidm.loria.fr/.


Introduction
Many biological processes from metabolic pathways to cellular signaling are mediated by protein-protein interactions (PPIs). However, the experimental determination and analysis of such interactions are often difficult and time-consuming. The developments in high-throughput gene sequencing techniques have created a huge gap between the ever increasing number of known protein sequences and the knowledge of their function and of their biological interactions. There is therefore much interest in developing computational approaches to predict PPIs. Because protein domains are the building blocks of proteins, PPIs mainly rely on given combinations of domain-domain interactions (DDIs). Predicting and modelling PPIs should therefore benefit from a systematic inventory of all possible DDIs.
In fact, the interest raised by DDI study is justified by simple combinatorial reasoning. As most proteins are composed of a limited subset of domains, the study of all possible interactions between the hundreds of millions of proteins available in the sequence databases can be advantageously reduced by the study of the possible interactions between the tens of thousands of domains existing in current domain classifications such as Pfam (17,929 families in the 2018 release 32; [1]) or CDD (Conserved Domain Database, 52,910 conserved domain models in the v3.17 version; [2]. It is therefore of utmost importance to enumerate all plausible DDIs which could explain all the PPIs described so far and which could be used to predict interactions which have not yet been described. Today, well established DDI resources are based on experimental evidence such as 3D structures extracted from the Protein Data Bank (PDB). This is the case of 3did (updated every year; [3][4][5]), KBDOCK (last release in 2016; [6]), iPfam (last release in 2013; [7]) and INSTRUCT (last release in 2013; [8]). However the DDI content of these databases does not account for all PPIs described so far. For example, it has been estimated in 2015 that DDIs extracted from 3did only cover around 20% of PPIs from the STRING database [9].
For this reason, many computational methods have been proposed for inferring DDIs from the domain composition of proteins involved in PPIs. The pioneer work by Sprinzak and Margalit in 2001 used a simple correlation measure of domain occurrence in PPIs [10]. It was rapidly extended and improved in quite diverse manners.
A first group of methods uses probabilistic or statistical approaches including measures of the interaction probability between two domains [11,12] or involving expectation maximization algorithm to maximize a certain likelihood function over a given interactome [13][14][15], enrichment calculation [16], sometimes with the help of functional (Gene Ontology) annotations [17,18] or with complex network neighborhood coefficients [9].
Other methods exploit the properties of DDIs related to sequence and gene evolution, such as the concepts of domain fusion [19], phylogenetic profiling [20], relative co-evolution of domain pairs [21,22], correlated mutations [23], or combination of these properties [24].
Another group of methods aims to explain a given set of PPIs with a minimum set of DDIs using various optimization methods such as Integer Linear Programming for parsimonious explanation [25][26][27], genetic algorithm [28], or parameter-dependent selection [29].
Finally, other authors use PPI and non PPI datasets to develop machine learning approaches such as random forest framework [30,31], discriminant feature selection methods [32], or formal concept analysis [33].
The various datasets of inferred DDIs generated by all these methods do not overlap perfectly well, mainly because they rely on various non overlapping protein interactomes such as DIP (Database of Interacting Proteins; [34]), IntAct [35], HPRD (Human Protein Reference Database; [36]) or STRING [37]. Thus, several attempts have tried to merge a certain number of the produced DDI datasets and compute a score for each DDI depending on its presence in the composing resources.
The most famous integrated resource is certainly the DOMINE database, created in 2008 and updated in 2011 [38,39]. In its last available release (2011), DOMINE has merged not less than 12 sources of inferred DDIs with two sources of DDIs derived from 3D structures, namely iPfam (2007 release) and 3did (2010 release), leading to a dataset of 26,219 DDIs involving 5,140 distinct Pfam domains.
The UniDomInt [40] and IDDI (Integrated DDI; [41]) resources are other initiatives aimed at merging well-established sets of structural DDIs with sets of inferred DDIs from various methods (9 and 20 sets, respectively). The DIMA3.0 database (Domain Interaction MAp; [23]) integrates more than 5,800 structurally known DDIs from iPfam and 3did databases and 46,900 DDIs computed with four of the methods listed above.
The fact that different methods infer the same DDIs can be considered as an argument for DDI reliability. Indeed, in the DOMINE resource, an index quantifying the percentage of overlap (POI) between the twelve sources has been defined and each DDI has received a score calculated as the sum of the POIs of all the sources inferring this DDI. Depending on their score and on their participation in the same biological process (according to the Lee's dataset; [17]), DDIs are classified as high, medium or low confidence. Resources such as UniDomInt and IDDI use slightly different methods to assess consistency between the DDIs datasets [40,41].
The extreme heterogeneity of methods for inferring DDIs and the absence of regular updates for most of the generated datasets have lead to the absence of any available large-scale updated repository of scored DDIs, reflecting the multiple PPI resources available today. Such a repository could be advantageous in studying PPIs at the domain level.
Therefore, we were encouraged to develop a generic new method capable of exploiting multiple PPI datasets for the automatic inference of DDIs. The method is called PPIDM (for Protein-Protein Interaction Domain Miner) and is inspired both by the statistical approaches described above and by the CODAC method, that we previously designed to infer protein domain annotations [42]. CODAC uses a tri-partite graph setting and a Gold-Standard of existing associations to infer all possible associations between two sets of objects, assuming the objects share a similar neighborhood in a third set of objects. In PPIDM, the two sets of objects are sets of domains and the third set is a set of PPIs.
In this paper, we first describe the PPIDM algorithm and its application to seven datasets of PPIs using a Gold-Standard of 7,254 DDIs derived from 3did and KBDOCK. We then compare the inferred DDIs (84,552 pairs of Pfam domains) with inferred DDIs in DOMINE, revealing a large percentage of newly inferred DDIs in PPIDM. We also analyse the overlap with the 2017 to 2020 releases of 3did, revealing that the DDIs inferred by PPIDM overlap with 3, 239 (46%) DDIs present in the 3did 2020 release (not counting the Gold-Standard), including 800 newly observed DDIs that were not known in 3did when PPIDM was generated. We also study the coverage of PPIs from curated (IMEx) or non curated (STRING) databases by PPIDM DDIs and we compare with DOMINE to the advantage of PPIDM. Finally, we describe the biological plausability of a subset of newly inferred DDIs from PPIDM.

CODAC-inspired PPIDM approach
The CODAC-inspired PPIDM approach is based on a tripartite graph setting [42]. In graph theory, a k-partite graph is a graph whose vertices can be partitioned into k disjoint subsets, such that in each subset no two vertices are connected. If k = 2, the graph is called a bipartite graph (or bigraph), and if k = 3 it is called a tripartite graph. The tripartite graph setting is designed here to solve problems of bipartite graph enrichment also known as edge prediction or inference. The main intuition is to calculate new weighted edges between two sets of items when sparse known edges already exist between the two and when both sets display dense connections with a common third set of items. Let GðX; Y; Z; EÞ be a tripartite graph where X, Y and Z are 3 sets of items and E is the set of all edges connecting X, Y and Z in the input configuration. Let us consider the 3 bipartite subgraphs of G, denoted as G 1 ðX; Z; E 1 Þ, G 2 ðY; Z; E 2 Þ, and G 3 ðX; Y; E 3 Þ. We now assume that the set of edges E 3 is incomplete, and that the aim is to compute new edges between items of X and items of Y in order to generate G � 3 ðX; Y; E � 3 Þ which together with G 1 and G 2 will make the final tripartite graph, G � ðX; Y; Z; E � Þ, where E � denotes an enriched set of edges. New edges may be inferred by exploiting the existing edge distributions in G 1 and G 2 . The hypothesis here is that if items x i of X and y j of Y share the same (or almost the same) set of neighbors {z k } in Z, then it may be inferred, with a certain confidence to be estimated, that an edge exists between x i and y j .
In the PPIDM approach, we want to infer new DDIs between domains that are present in the proteins participating in PPIs. Thus, we instantiate the tripartite graph settings with Z, a set of PPIs, and X and Y, two sets of Pfam domains that compose the proteins from Z, as shown in Fig 1. More precisely, we distinguish the left and right components of the pair of proteins forming a PPI and ordered according to the alphanumerical order of their identifiers (Ids). The first bipartite graph G 1 ðX; Z; E 1 Þ represents the relations between domains of X and the Here, the (d3, d2) edge is inferred because d3 and d2 are found in ppi1 and ppi2, and (d3, d4) is inferred because d3 and d4 are found in ppi2 and ppi3. However, the score of (d3, d2) will be lower than the score of (d1, d2) because d3 has one neighbor that does not contain d2 (namely ppi3).
https://doi.org/10.1371/journal.pcbi.1008844.g001 left proteins of PPIs in Z, while the second bipartite graph G 2 ðY; Z; E 2 Þ represents the relations between domains of Y and the right proteins of PPIs in Z. In Fig 1, the X and Y sets are named D L and D R , respectively. By construction, D L and D R are overlapping but distinct sets of domains encompassing the domains present in the left and right proteins of PPIs, respectively. The third bipartite graph G 3 ðX; Y; E 3 Þ is initialized with a set of edges representing DDIs observed in structural databases. Our Gold-Standard set of edges E 3 pertains from the intersection of the KBDOCK and 3did databases.
The output of the PPIDM algorithm is E � 3 , which will contain an enriched set of DDIs weighted by their neighborhood similarity score. The adjacency matrices M X of X in Z and M Y of Y in Z are built and cosine similarities are computed between each row of M X and each row of M Y , representing the neighborhood similarity score of the corresponding two domains in the PPI dataset Z.
At this stage of the work a p-value can be calculated to estimate the significance of inferring a certain edge (x, y) given Z. Indeed, it is reasonable to suppose that an edge (x, y) could get a high score at random if the x and y items are each highly connected to many items in Z. Here, we assume that the probability of finding an edge (x, y) by random chance is given by an hypergeometric distribution of the number of common neighbors (x, z) and (y, z). Letting N z denote the total number of items in Z, N x the number of neighbors of x in Z, and N y the number of neighbors of y in Z, the probability that the (x, y) pair gets a number of common neighbors K greater than or equal to the observed K x,y is given by the hypergeometric probability Eq (1): Because the p-value test is calculated for a large number of (x, y) edges in G � 3 , we apply a Bonferroni correction to take into account the so-called family-wise error rate [43]. Therefore, letting jE � 3 j denote the total number of edges tested, we consider any p-value less than 0:05=jE � 3 j as denoting a statistically significant edge.
In order to determine an edge similarity threshold, we need to define a learning set of positive and negative examples of DDIs. Here, we take all the Gold-Standard DDIs as positive examples (Pos). In practice, we extract a total of 8,581 and 8,670 DDIs from 3did and KBDOCK, respectively. We then obtain 7,254 common DDIs in which each domain is represented by a Pfam identifier. These DDIs are scored among others in the general cosine similarity matrix associated with the tripartite graph. To create negative examples, we shuffle the edges corresponding to the Pos DDIs in G 3 in order to rearrange in a random way all the Gold-Standard DDIs, while keeping the node degrees of each x i and each y j conserved, and not allowing to return to the original edges. We then select at random 7,254 shuffled negative DDIs and look for their score in the general cosine similarity matrix. We could get a score for only 5,145 of them (named thereafter the Neg DDIs) but imbalanced data is not a problem here since we used F 1 − score for selecting the threshold.
Taken together, the Pos and Neg DDIs constitute our learning set for threshold determination. We randomly split the learning set into two groups with equal proportions of positive and negative examples to give the "Training" and "Test" subsets representing respectively twoand one-third of the learning set. We then apply 5-fold cross-validation to the Training subset to determine the optimal score threshold leading to maximal F-measure. For each fold, we rank the scores of 4/5 folds and label them "positive" or "negative" according to a score threshold that is varied in three phases with increasing resolution: firstly from 0.00 to 1.00 in steps of 0.01, then from 0.00 to 0.04 in steps of 0.001, and finally from 0.01 to 0.02 in steps of 0.0001. This allows us to determine the numbers of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions for each threshold value. We then calculate the recall, R = TP/(TP + FN), precision, P = TP/(TP + FP), and F 1 − score, F 1 = 2RP/(P + R). The similarity threshold T that gives the best F 1 − score in the 4/5 folds is verified using the fifth fold, the procedure is repeated for the five folds taken as test fold. The mean threshold T m is then used to calculate the resulting F 1 − score on the entire Training set. The robustness of this threshold is ultimately verified by calculating the F 1 − score on the Test set (not used during threshold optimization). If satisfying, our approach is validated and the mean threshold is retained to calculate a filtered cosine similarity matrix,

Combining multiple datasets
One original feature of the PPIDM approach is to be designed for handling multiple data sources of PPIs (Z in our tripartite graph model). When more than one PPI datasource is used, a tripartite graph is built for each data source d and processed separately to calculate its respective cosine similarity matrix C d . All matrices are then combined in a unique consensus matrix merging all domain subsets considered in all the graphs. Whenever there is no data for a given pair (x, y) in an input graph G d , the corresponding score C d x;y is set to zero. The cosine similarity scores are then combined as a non-zero weighted average (over all non-zero scores) to give a consensus similarity matrix, CS.
Weight determination is performed using Receiver-Operator-Characteristic (ROC) analysis as for information retrieval system when one wishes to retrieve positive documents as first ranked, i.e. with the best scores [44]. Here the positive examples are taken from the Gold-Standard set of positive DDIs and the background will be formed by all the scored DDIs in the consensus similarity matrix. One advantage of ROC-based approaches is that they are rather insensitive to the particular number of positive and negative instances used [45]. Thus, in order to find the best values for the weights w d relative to each data source used, each weight is varied from 0.01 to 1 in steps of 0.01. For each combination of weights, a ROC performance curve is calculated using the complete ranked list of consensus scores and our Gold-Standard set of positive examples. The combination of weights that gives the largest area under the curve (AUC) is selected and used to calculate the best consensus similarity matrix CS.
Algorithm 1 summarizes all this procedure that allows PPIDM to handle multiple data sources of PPIs to infer DDIs.
We finally classify our DDIs into Gold, Silver, and Bronze categories using the p-values calculated for each DDI in each input databases. Gold DDIs are those with a non-zero score in at least half of the data sources and that display a significant p-value in all these data sources, Silver DDIs are those with a non-zero score in less than half of the data sources while displaying a significant p-value in all these data sources, and Bronze are the remaining DDIs.

Application to PPI sources
In this study, seven PPI sources have been used: IntAct and MINT [46], DIP [34], HPRD [36], and BioGRID [47,48] are manually curated databases of physical interaction between proteins; the very extensive STRING database [37] contains both physical and predicted interactions between proteins; the SIFTS database [49,50] allows to access to PPIs present as complexes between PDB chains in the PDB. Thus, these seven PPI databases together provide a comprehensive combination of all available protein interactions and are appropriate for our PPIDM approach. Note that we retrieve all available interactions from these databases and we do not discriminate between stable and transient interactions.
Flat data files of IntAct, DIP, MINT, HPRD, BioGRID, STRING, SIFTS, KBDOCK, 3did and UniProt (all releases as of February 2017), were downloaded and parsed using in-house Python scripts. The four sources IntAct, DIP, MINT, HPRD and SIFTS contained PPIs expressed as ordered pairs of UniProt sequence accession numbers (UniProtIds). In BioGRID and STRING, proteins are designated with an identifier system specific to each database. These identifiers were mapped to UniProtIds to produce the BioGRID and STRING datasets of PPIs. In the SIFTS database, associations between PDB chains were extracted and PPIs displaying a high possibility of interaction according to [51] were stored. Then, PDB chains possessing a representative UniProtId were replaced by this UniProtId. The numbers of PPIs obtained from the input resources are shown in Table 1. A very large number of protein interactions is drawn from the STRING database, while the SIFTS database only provides a small collection of observed PPIs. We then categorized the STRING PPIs according to experimental and non-experimental (Text mining, Neighborhood, Fusion-fission events, Occurrence, and Coexpression) labels and stored them as STRING-exp and STRINGrest datasets, respectively.

Algorithm 1 Calculating a Consensus Similarity Matrix from Multiple Sources of Common Neighbors
For each data source, the PPIs ppi k were represented by ordered pairs of UniProtIds (L k , R k ), in which L k always precedes R k in the alphanumerical order to avoid redundancy.
Associations between UniProtIds and Pfam domains were then extracted from UniProtKB/ SwissProt and UniProtKB/TrEMBL to give a dataset of UniProtIds-Pfam relationships for

Methods for PPIDM result evaluation
The set of inferred DDIs was compared with various other sets of DDIs using Python dataframes and a setwise representation for DDIs to avoid duplicate comparison of symmetrical pairs. The June2017, January2018, January2019 and April2020 releases of 3did were used after subtracting the Gold-Standard data set (7,254 DDIs), in order to check how many inferred DDIs correspond to DDIs in 3did that were not used as Gold-Standard during PPIDM generation, and how many inferred DDIs correspond to DDIs that were not known from 3did at the time PPDIM was generated. The DOMINE 2011 [39] dataset was also downloaded from its web sites (https://manticore.niehs.nih.gov/cgi-bin/Domine and prepared for similar comparison purpose. Venn diagrams were generated using the matplotlib Python library. The coverage of PPIs by DDIs was measured in a large set of non redundant PPIs, named here IMEx-PPIs, extracted from the curated database assembled by the International Molecular Exchange Consortium (IMEx; [52,53]), downloaded from https://www.imexconsortium. org/ on 23 August, 2020. Another dataset was also built using the human STRING subset of PPIs [54], downloaded on 30 July, 2019, and filtered to retain only PPIs with a STRING score greater than 800. This dataset is named here hSTRING-PPIs. Only PPIs composed of proteins containing at least one Pfam domain were retained, leading to a total of 50,032 and 607,088 PPIs for IMEx-PPIs and hSTRING-PPIs respectively. A PPI is counted as covered by a DDI inferred from PPIDM when at least one DDI from this dataset is present among the pairs of domains derived from the domain composition of each protein in the PPI. A corollary result is the estimation of the percentage of useful DDIs in the dataset. A useful DDI is a DDI used to cover at least one PPI. Counting was performed with PPIDM and DOMINE datasets. A negative control was added here thanks to a dataset of negatively labelled DDIs known as Nega-tome2.0 [55] that was downloaded from http://mips.helmholtz-muenchen.de/proj/ppi/ negatome/), and prepared similarly to the other datasets.  Table 2). In our ROC-based training procedure, the best AUC value of 0.9944 was obtained with weights w IntAct = 0.05, w DIP = 0.01, w MINT = 0.01, w BioGRID = 0.09, w STRING − Exp = 0.12, w STRING − Rest = 0.06, w HPRD = 0.17, and w SIFTS = 1.0. These weights give far greater importance to the candidate interactions in the SIFTS dataset, which is derived from the PDB, compared to those from other databases. This is consistent with the fact that our Gold-Standard positive instances are observed interactions extracted from PDB entries. The optimal score threshold for the consensus similarity matrix was determined according to the F1 − score calculated on the learning set of positive and negative DDIs described above. A score threshold of 0.01586 was obtained for a maximal F-Measure of 0.9718 on the Training set. Applying this threshold to the Test set yielded a comparable F-measure of 0.9717, and precision and recall values of 0.9808 and 0.9628, respectively.

Analysis of the inferred DDIs
The overall results of PPIDM execution are summarized in Table 2. This table shows the numbers of DDIs before and after filtering with the consensus score threshold. After applying the 0.01586 score threshold, the number of pairwise DDIs falls to 84,552 i.e. 1.84% of the merged dataset with an overlap of about 96.3% (6,989�7,254) with the Gold-Standard (3did \ KBDOCK) reference DDIs. Table 3 shows the distribution of PPIDM inferred DDIs in our Gold, Silver, and Bronze categories, along with the degree of overlap with the Gold-Standard reference dataset. This table shows that PPIDM provides 9,175 Gold DDIs (with only statistically significant scores in more than 4 PPI sources) and 24,934 Silver DDIs (with only statistically significant scores in less than 4 PPI sources), and 50,443 Bronze DDIs (having at least one non statistically significant score). However, it also shows that the 6,989 Gold-Standard DDIs retrieved in PPIDM are broken down in 2,852 Gold, 3,888 Silver and only 249 Bronze DDIs. This is a good indication that the PPIDM scoring and filtering strategy likely infers high quality and relevant new DDIs in the Gold and Silver categories.

Comparison with the DOMINE dataset
In order to compare PPIDM with the DOMINE dataset, we extracted DDIs from the file available from the latest version of the DOMINE database (2011). The DOMINE file contains 26,219 inferred DDIs (Pfam-Pfam interactions) with 5,410 distinct Pfam domains. This set (shown as green in Fig 2) was compared with the set of all 84,552 DDIs inferred by PPIDM (blue in Fig 2). This comparison showed that 8,433 DDIs from DOMINE (about one third of this dataset) are present in the PPIDM dataset, including 4,824 DDIs from the Gold-Standard (Intersection between yellow, blue, and green in Fig 2). The remaining 17,676 DDIs in DOMINE were then compared with the DDIs from the Gold-Standard. This comparison (the intersection of purple and yellow minus blue) showed that only 110 DDIs are common to DOMINE and the Gold-Standard but not to PPIDM, indicating that PPIDM misses only 2% (110 � 4,934) of the DOMINE DDIs confirmed by the Gold-Standard. This comparison also shows that the PPIDM result set only infers 3, 609 DOMINE DDIs from its 21,285 (26,219 − 4,934) DDIs not shared with the Gold-Standard. This percentage is rather low (around 17%) and likely reflects the heterogeneity of methods used for DDI inference in DOMINE and the lack of consensus between the datasets merged in DOMINE.

Evaluation of PPIDM predictions
It is very difficult to review individually the large amount of DDIs inferred by PPIDM. A first attempt of global evaluation has consisted in analyzing the capacity of PPIDM to infer experimental DDIs, for example observed in PDB complexes, outer from the Gold-Standard. For this purpose, we downloaded the DDI sets from the 2017(june), 2018, 2019 and 2020 releases of the 3did database, we subtracted the Gold-Standard set of DDIs (7,254 DDIs) leading to the so-called 3did δ dataset. We then checked the overlap between the remaining DDIs and the set of DDIs inferred by PPIDM (Fig 3, panel A) or DOMINE (panel B). It can be seen that the DDIs inferred by PPIDM cover an increasing number of DDIs from the various 3did releases. The percentage of coverage decreases with time (from 70% for the 2017 release to about 46% for the 2020 release). A similar behaviour is observed with the DOMINE dataset but to a much lesser extend. The percentage of coverage varies from 28% for the 2017 release, to about 17% for the 2020 release of 3did, at the advantage of the PPIDM approach. More precisely, from the 3,854 new DDIs in 3did 2020 that were unknown in 2017 (when subtracting 3did2017(february) from 3did2020), PPIDM has inferred exactly 800 DDIs and DOMINE only 234.
The disadvantage of DOMINE here is likely due to the fact that the datasets it contains have been produced several years ago (from 2004 to 2010). However for PPIDM, it shows that a non negligeable subset of inferred DDIs are validated by observations in 3did that did not exist at the time PPIDM was produced.
The decrease in the percentage of coverage of 3did by PPIDM inferred DDIs can be explained by the rapid growth of 3did that directly follows the growth of the PDB. Recent PDB entries, posterior to 2017 could not be used by our version of PPIDM (2017). Indeed, we determined that among the 3,859 DDIs not inferred by PPIDM in 3did 2020, 90.3% had a very limited number (Nb � 5) of PDB entries before 2017, including 39.1% of DDIs without any PDB entry in 2017 or before. By design, it was impossible for PPIDM to infer such DDIs.
Another way to evaluate the PPIDM set of inferred DDIs was to study the coverage of existing PPIs by these DDIs. For this purpose, we used a recent dataset (august 2020) of 50,032 non redundant curated PPIs from the IMEX database [53]. Only PPIs composed of proteins containing at least one Pfam domain were retained to allow coverage analysis by 3did 2020, PPIDM, DOMINE, and Negatome as a negative control. Results are presented in Fig 4A and reveal that PPIDM covers 13,316 PPIs from IMEx, i.e. about 4,000 PPIs more than DOMINE and almost 2.2 times higher than the number covered by 3did 2020.
To complement these results, we also computed the number of DDIs that cover at least one PPI of the IMEx dataset. The obtained results are shown in Fig 4B. The number of relevant  DDIs for PPI coverage in PPIDM is more than 3 times the number of relevant DDIs in DOMINE and 4.6 times higher than the number of relevant DDIs in 3did2020. Thus, the PPIDM dataset encompasses an interesting diversity of DDIs that deserves further studies.
Similar results were obtained at a larger scale using a subset of STRING containing 607,088 human well-scored PPIs (Fig 5). These results yield an update of the capacity of existing available sets of DDIs to cover large-scale sets of PPIs on the basis of their Pfam composition: about 24%, 35% and 30% of the STRING subset of PPIs are covered by DDIs from 3did2020, PPIDM and DOMINE, respectively (Fig 5A). Moreover, a total of 23,297 DDIs (i.e. 27.5% of the PPIDM dataset) are used to cover at least one PPI of the hSTRING-PPIs dataset (Fig 5B).
Finally we selected a few examples of newly inferred DDIs to study their possible biological interest. We selected the PPIDM (Gold) DDIs that cover at least one PPI from Imex and filtered out the DDIs present in 3did 2020 or DOMINE. There were 1,915 DDIs left. We extracted those DDIs covering more than 10 PPIs in IMEx (155 DDIs) and ranked them from higher to lower PPIDM consensus score. In Table 4, we present five examples selected on the basis of interpretable Pfam description among the top-15 DDIs.
Only the first DDI here is an homodimer. The concerned domain PF01352 is part of a zinc finger protein and is involved in protein-protein interactions. The observation of this inferred DDI suggests that at least some zinc finger proteins could act as dimers. The second DDI  Table 4. Description of selected Gold DDIs newly inferred by PPIDM and absent from 3did2020 or DOMINE. The PPIDM consensus score and the number of covered PPIs in the IMEx database are indicated. The domain short name is taken from the corresponding Pfam entry and followed by a short description adapted from the corresponding Pfam entry. in the table (PF11620-PF16517) involves two domains sharing the same SARAH region which is reported to mediate the formation of homo and hetero complexes. The third DDI (PF03931-PF12937) is between a domain present in SKP1 proteins and a domain (F-Box-like) known to mediate interaction with SKP1 proteins. Our fourth example (PF00622-PF13765) appears to concern two frequently associated domains PRY and SPRY, that are very similar.

Inferred
The SPRY domain is reported as a protein-interaction module but it remains to be demonstrated whether the two domains actually interact together. Finally, the last DDI in the  The large size of the PPIDM dataset contrasts with the more limited sizes of DOMINE or DIMA 3.0 datasets (around 26 and 50 thousands of DDIs respectively). However, other authors already got even larger datasets in the past. In [9], a dataset of 288,098 DDIs was inferred using two neighborhood cohesiveness coefficients and a combined proportion method on the STRING interactome. In that work as in ours, the purpose is not the same as with the parsimonious methods that aim to explain a maximum number of PPIs in an interactome with the minimum number of DDIs. Indeed, large repositories of DDIs should rather be understood as reservoirs from which possible DDIs can be picked up to support working hypotheses or to select candidate interactants for a protein of interest.

Discussion
As the DDIs inferred by PPIDM are qualified with a score and a p-value, many explorations are possible from now on with this dataset. For example it will be interesting to analyse the 800 DDIs inferred in 2017 that were subsequently found new in 3did between 2017 and 2020, in order to learn which features or annotations associated with these DDIs have favored their experimental confirmation in 3did.
Network properties have been computed using the Cytoscape software for PPIDM Gold and Silver subsets of DDI and are compared with the results obtained for 3did 2020 and DOMINE. Comparison is also performed with the two PPI networks used for PPIDM evaluation. Table 5 summarizes the comparison in terms of connected components and node degree distribution. All the networks considered here display the global organisation already reported in [40]: a giant component without any detectable subcluster and a series of smaller isolated components. However the percentage of nodes participating in the giant component is much higher in the PPI networks (more than 97%) than in the DDI networks (less than 75%). Consequently, DDI networks contain many more small connected components than PPI networks, including numerous components of size 1 corresponding to self-interacting domains, and components of size 2 corresponding to isolated DDIs. This difference in the distribution of connected components is consistent with the notion that PPIs can display much more interconnections than DDIs because proteins can interact with multiple partners via several domains.
It should be noted that the number of connected components of size 1 correspond to isolated domains that only interact with themselves. However self-interacting domains (or homo-DDIs) are also present in the other components. In total PPIDM counts 7,203 homo-DDIs of which exactly 3000 and 4160 are present in the Gold and Silver subsets respectively, corresponding to 79.6% and 45.2% of the nodes respectively. These percentages can be compared to those observed in 3did 2020 (70.6%), DOMINE (64.2%), or the one reported for UniDomInt (32.5%) in [40]. This clearly indicates that not all domains are capable of self interactions and that, in terms of network properties, the PPIDM (Gold) dataset is closer to the reference 3did 2020 than are the PPIDM (Silver) or DOMINE datasets.
The distribution of nodes according to their degree is also different between DDI and PPI networks. The general shape is right-skewed with a large number of nodes having small degrees and unique nodes with very high degrees (hubs). This confirms that none of these networks is random but, on the contrary, they represent complex sets of interactions. In the four DDI networks considered in this study, the maximum number of nodes is always obtained for degree 2 whereas in the two PPI networks it is for degree 1. The superior degree for 99% of the ranked nodes is rather low for DDI networks (17, 27 and 30 for 3did 2020, PPIDM (Gold) and PPIDM (Silver) respectively), except for DOMINE (98). This superior degree is very high for hSTRING (431) and intermediate for IMEX (78). The nodes with the maximal degree are also indicated in Table 5 and reach very high values for PPI networks. In DDI networks, these outliers indeed correspond to domains responsible for interactions in a large variety of proteins. Three of them (PF00018, PF00069 and PF00096) belong to the list of outliers observed in Uni-DomInt [40].
A limit of this study is certainly that the PPIDM dataset has been produced on 2017 releases of available PPI resources. While it is always possible to update the dataset on more recent

PLOS COMPUTATIONAL BIOLOGY
releases, it seemed relevant to us to explore the predictive value of these results derived from previous sources in the light of new data (PPIs or DDIs) obtained subsequently. Another limit lies in the fact that the PPI sources used in this study are overlapping. For example, STRING imports its experimental data from IMEx, BioGRID and PDB, among other sources and IMEx is made up of IntAct, MINT and DIP. Thus, a non negligeable proportion of PPIs have been reused several times to infer DDIs. Nevertheless, it should be noted that the CODAC process computes similarity scores and p-values for each PPI source separately. These separated values are available in the datasets proposed for download at http://ppidm. loria.fr. Then, data reuse is only effective (and somewhat inevitable) when merging scores, as in the case of other combined DDI datasets such as DOMINE. Another study would be to estimate the impact of this reuse of PPIs on the number or quality of DDIs recovered. Reuse of PPIs also occurs at the evaluation step with IMEx and a subset of STRING. However, this evaluation step does not claim to be similar to the validation of a predictive model in machine learning. It is more akin to evaluating the quality of a corpus produced by an information retrieval procedure. Indeed, the main outcome of our work is to produce a large reservoir of DDIs available for further studies.
Just like all other studies dealing with DDIs, PPIDM encounters by design a limitation in explaining PPIs that do not interact through well-defined conserved domains but rather through intrinsically disordered regions [56]. This is likely the reason why a certain percentage of the interactome cannot be explained by inferred DDIs even with a large dataset such as PPIDM.
With the tripartite graph paradigm, our PPIDM inference method is related to link prediction methods in multipartite or knowledge graphs. Our generic graph formulation of link prediction has been found to be useful for modeling the multi-source approach. In addition, the PPIDM setting makes it possible to infer interactions with domains subsets in addition to single domain interactions. This is not the case for most other DDI-inferring methods, except for the discriminative feature selection method described in [30]. For PPIDM, we just need to consider subsets of domain as elements of the two sets X and Y in the tripartite framework and define their relation to PPIs in Z. This extension of PPIDM is quite original and could provide clues in the analysis of multi-domain protein interactions. Indeed, while a small number of single domain proteins interact with their biological associates directly, a much larger number of proteins have more than one domain [57], and interactions between these multi-domain proteins can often involve two or more domains [58].

Conclusion
This paper has introduced and presented a new approach named PPIDM for inferring DDIs from multiple sources of PPIs. The obtained dataset displays interesting properties including a good scoring of experimental DDIs and a significant coverage of existing PPIs. Further studies will be devoted to better describe the inferred DDIs and possibly validate them through largescale experimental studies. It could be envisaged for instance to look for correspondences between PPIDM DDIs and cross-linked peptides obtained with cross-linked mass spectrometry, as already performed for PPIs in Drosophila embryos [59]. We believe that the large set of inferred DDIs produced in this study can be used to interpret interactome data and provide added-value to protein network analyses.