Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks

doi:10.1371/journal.pone.0262056

Fig 1.

Different topologies are exhibited by human protein complexes.

. a. Clique (Commander/CCC complex), b. Hybrid with different edge-weights (BLOC-1 (biogenesis of lysosome-related organelles complex 1)), c. Hybrid (NRD complex (Nucleosome remodeling and deacetylation complex), d. Linear (Ubiquitin E3 ligase (CUL3, KLHL9, KLHL13, RBX1)). These are experimentally characterized complexes from CORUM [3] with protein interaction evidence obtained from hu.MAP [4].

More »

Expand

Fig 2.

Super.Complex identifies likely protein complexes within a PPI network using a distributed supervised AutoML method.

Task 1: Learning a community fitness function: (i) Topological feature extraction: Topological features are extracted from known communities to build community embeddings (feature vectors, which are representations of communities in vector space) (ii) Supervised learning with AutoML: A score function for communities, the community fitness function, is learned from the community embeddings as the decision function for binary classification of a network subgraph as a community or a random walk (illustration on the right). The best score function is selected after training multiple machine learning models with TPOT [18], an AutoML pipeline. Task 2: Searching for candidate communities in the network: (iii) Intelligent sampling: Multiple communities are sampled in parallel from the network. To build each candidate community, a seed edge is selected and grown using a 2-stage heuristic. First, we use an epsilon-greedy heuristic to select a candidate neighbor, and then we use a pseudo-metropolis (constant probability) or iterative simulated annealing heuristic to accept or reject the candidate neighbor for growing the current community. An iteration of neighbor selection using a greedy heuristic is shown (illustration on the left), starting from a seed edge {F, I}. The edge is grown to the subgraph {F, I, E} as adding node E yields a higher community fitness function than adding any other neighbor of F and I. The seed edge {B, C} is grown in parallel (not shown) (iv) Merging overlaps: The candidate communities are merged such that the maximum overlap between any 2 communities is not greater than a specified threshold.

More »

Expand

Fig 3.

Proposed evaluation measures—FMMF, CMFF, and UnSPA are sensitive metrics.

a. Bipartite graph, where each edge weight corresponds to the F-similarity (sim_F(C_k, C_l)) between C_k, a known community from K, the set of known communities and C_l, a learned community from L, the set of learned communities. b. The F-similarity score combines precision (P(C_k, C_l)) and recall (R(C_k, C_l)) measures, computed as fractions of the number of common nodes w.r.t the number of nodes in a community. |C| is the number of nodes in community C and |C₁∩C₂| is the number of nodes common to both communities. c. F-similarity-based Maximal Matching F-score (FMMF) combines precision (P_FFM) and recall (R_FFM) measures computed for a maximal matching, M of the bipartite graph in Fig 3A d. Community-wise Maximum F-similarity based F-score (CMFF) combines precision (P_CMF) and recall (R_CMF) measures, averaging over the maximum F-similarity score for a community in a particular set (e.g. known communities) w.r.t to a community of the other set (e.g. learned communities) e. UnSPA is an unbiased version of Sn-PPV accuracy (SPA), computed as the geometric mean of unbiased PPV (PPV_u) and unbiased Sensitivity (Sn_u), computed similar to precision and recall measures in CMFF, only, instead of the F-similarity score, precision and recall similarity scores are used respectively f. Sensitivity of different evaluation measures w.r.t. (maximum pairwise Jaccard coefficient) overlap between communities shows that FMMF, CMFF, UnSPA, and existing measures Qi et al. F1 score (Eq 8 in S1 File), and SPA (Eq 9 in S1 File) are sensitive metrics, with FMMF, CMFF, and Qi et al. F1 score following the desired trend. Here, each data point on the plot corresponds to a measure evaluating an individual run of Super.Complex’s merging algorithm with a maximum Jaccard overlap threshold set to the x-axis value.

More »

Expand

Fig 4.

Learned human protein complexes with Super.Complex achieve good PR curves and follow similar size distributions as known complexes.

a. PR curve for the best model (community fitness function) from the AutoML pipeline on the test dataset, for the task of classifying a subgraph as a community or not. b. Co-complex edge classification PR curve for final learned complexes. c & d. Best F-similarity score distributions per known complex and per learned complex. e. The size distributions of train, test, and all known complexes, learned complexes, and learned complexes after removing known complex proteins.

More »

Expand

Table 1.

Best parameters found and used in each of the experiments.

More »

Expand

Table 2.

Evaluating learned complexes on hu.MAP w.r.t ‘refined CORUM’ complexes.

More »

Expand

Table 3.

Comparing our method with 6 supervised and 4 unsupervised methods on a yeast PPI network.

More »

Expand

Fig 5.

Examples of complexes with proteins having low annotation scores.

a. C11orf42 constitutes the Retromer complex (SNX1, SNX2, VPS35, VPS29, VPS26A), potentially related to trafficking, with C11orf42 localized in cells to vesicles, similar to the other proteins of the complex (SNX1, SNX5, and VPS29) b. C16orf91 constitutes the COX 20-C16orf91-UQCC1 complex, potentially localized to mitochondria like COX20. c. C18orf21 constitutes the Rnase/Mrp complex, with C18orf21, localized to nucleoli, closely interacting with nucleoplasm proteins of the complex such as RPP25, POP5, RPP14, NEPRO, RPP30, IBTK, RPP25L, and NPM1. The images of subcellular localization are available from v20.1 of proteinatlas.org, as https://v20.proteinatlas.org/ENSG00000*/cell, where * is 180878-C11orf42, 028528-SNX1, 089006-SNX5, 111237-VPS29, 167272-POP5, 163608-NEPRO, 148688-RPP30, and 181163-NPM1. Note that localizations were measured in varying cell types, including HeLa, HEL, U2OS, and U-251 MG cells, across the highlighted proteins.

More »

Expand