Protein complex prediction via dense subgraphs and false positive analysis

Many proteins work together with others in groups called complexes in order to achieve a specific function. Discovering protein complexes is important for understanding biological processes and predict protein functions in living organisms. Large-scale and throughput techniques have made possible to compile protein-protein interaction networks (PPI networks), which have been used in several computational approaches for detecting protein complexes. Those predictions might guide future biologic experimental research. Some approaches are topology-based, where highly connected proteins are predicted to be complexes; some propose different clustering algorithms using partitioning, overlaps among clusters for networks modeled with unweighted or weighted graphs; and others use density of clusters and information based on protein functionality. However, some schemes still require much processing time or the quality of their results can be improved. Furthermore, most of the results obtained with computational tools are not accompanied by an analysis of false positives. We propose an effective and efficient mining algorithm for discovering highly connected subgraphs, which is our base for defining protein complexes. Our representation is based on transforming the PPI network into a directed acyclic graph that reduces the number of represented edges and the search space for discovering subgraphs. Our approach considers weighted and unweighted PPI networks. We compare our best alternative using PPI networks from Saccharomyces cerevisiae (yeast) and Homo sapiens (human) with state-of-the-art approaches in terms of clustering, biological metrics and execution times, as well as three gold standards for yeast and two for human. Furthermore, we analyze false positive predicted complexes searching the PDBe (Protein Data Bank in Europe) database in order to identify matching protein complexes that have been purified and structurally characterized. Our analysis shows that more than 50 yeast protein complexes and more than 300 human protein complexes found to be false positives according to our prediction method, i.e., not described in the gold standard complex databases, in fact contain protein complexes that have been characterized structurally and documented in PDBe. We also found that some of these protein complexes have recently been classified as part of a Periodic Table of Protein Complexes. The latest version of our software is publicly available at http://doi.org/10.6084/m9.figshare.5297314.v1.

describes the main mining heuristic, which is based on finding at most one dense subgraph starting at each node in DAPG.
The core of our mining technique is Table A2. It starts at each node v in DAPG and walks its way to the previous node in the path up to a root. Along the path, we maintain in set S the intersection of the vertexSet of the nodes in a subset of the visited nodes (those which provide a better partial DSG), while we maintain in set C the labels of the nodes of the selected subset. Note that, at each point, (S ∪ C, S × C) is indeed a valid graph. From all those DSGs, we retain only the "best one". We determine the "best DSG" using and objective function (f obj ), which is a configuration parameter. nodeDsg ← GetDenseSubgraphFrom(DAP G, node) 5: if nodeDsg is maximal w.r.t. DSGs then 6: DSGs ← DSGs − {dsg ∈ DSGs, dsg ⊂ nodeDsg} 7: DSGs ← DSGs ∪ {nodeDsg} 8: end if 9: end for 10: return DSGs Table A2: Detection of an DSG starting at a given node in DAPG.

Weighted and unweighted DAPG
In this section we present all results we obtained in terms of clustering metrics using the unified version of DAPG on weighted and unweighted PPI networks. Therefore here we present the results using different orders and merge options and objective functions for the mining algorithm explained in the main manuscript. Tables A4, A6, A8, A10 show the results using f obj = |S ∩ C| and Tables A5, A7, A9, A11 show the results using f obj based on weighted density definitions on small yeast PPI networks. Similarly we present the results on large PPI networks for yeast and human in Tables A12, A14, A16 with f obj = |S ∩ C| and Tables A13, A15, A17 with f obj = W DEGREE for measuring the performance using weighter density metrics.
In all experiments we used yeast PPI networks and reference CYC2008 [1] provided by clusterONE software distribution for yeast and PCDq for human. All experiments were performed with total order function φ

Methods used for comparison
In order to evaluate DAPG in detecting protein complexes, we used the following state-of-the-art methods: ClusterONE [2], MCL [3], Cfinder [4], and GMFTP [5]. The performance of each method depends on its parameter setting and the reference (gold standard) of protein complexes used as ground truth. Therefore, we first describe the main features of each of algorithms and provide the parameter tuning using the reference CYC2008 [1]. We optimized the parameters that achieved the best results based on MMR (Maximum Matching Ratio), proposed by clusterONE, and used the implementation for measuring clustering metrics provided by them and available at http://www.paccanarolab.org/static_ content/clusterone/additional_information.html. All experiments report the parameters and the clustering metrics: FMeasure, Acc, MMR as well as the execution time in seconds.

ClusterONE
ClusterONE detects overlapping protein complexes from weighted and unweighted PPI networks, and it is based on overlapping neighborhood expansion. The main parameter of clusterONE is d, which is the minimum density of clusters, and we keep the other parameters as given by default as has been used in previous work [5]. Tables A18, A19, A20, A21, and A24 present the results for Collins, KroganCore, KroganExt, Gavin and Biogrid.

MCL
MCL is based on detecting clusters using a model that uses random walks on the input graph adopting Markov Chains trying to discover where the flows concentrate forming clusters. The Inflation (I) parameter is its key parameter, which tunes the granularity of the clusters. We executed MCL using different inflations for all input PPI networks and we provide the results in Tables A25,  A26, A27, A28 and A29.

CFinder
CFinder is based on Clique Percolation Method (CPM) [6] to detect overlapping modules in biological networks. The CPM method consists of building communities from k-cliques where a community is defined as the maximal union of k-cliques that are connected through a series of adjacent cliques. The keys parameters in CFinder are the parameter k (for size k in k-clique) and the parameter t which represents the time in seconds allowed for searching a clique  from a node in the graph. For all PPI networks we found that our best results we for k = 3 for Collins, and KroganCore, k = 4 for KroganExtended and Gavin, and K = 6 for Biogrid. For the t parameters we found that using t = 1 or t = 10 in Collins, KroganCore, KroganExtended and Gavin all results were the same so we reported execution times for t = 1. However, in the case of Biogrid the execution time with t = 10 took more than 2 days so we used t = 1.

GMFTP
GMFTP is based on a generative model with functional and topological properties tending to predict protein complexes that are formed by group of proteins which frequently interact with each other and have similar functional patterns. The method transform the detection problem into a parameter estimation problem. The objective function in GMFTP is not convex and then the multiplicative updating rules of the algorithm does not necessarily converge to the global minimum. As a result, the method cannot guarantee the final estimator is the globally optimum solution and the result is not deterministic. This issue is addressed by the method having a parameter for repeating the entire calculation, which is the repeat t imes parameter. By default this parameter is set to 100.
In our experiments, when trying to execute GMFTP on all PPI networks, we found that using all the default parameters of GMFTP was impossible to get results before a day of execution time. Therefore, we left all parameters as the defaults, except the repeat time which we set to 10 instead of 100. Doing this we were able to get results in a little more than 12 hours of execution with

DCAFP
DCAFP is a method that predict protein complexes based on two main properties. The first considers the idea of dense connected proteins in the PPI network and the second is based on the idea that proteins in the same protein complexes are at least similar in specific subsets of funtional GO categories in the context of functional information given in the Gene ontology. DCAFP has three main parameters minsize attributes, delta, wmin, osmax, maxloops, where the paremeters wmin and osmax are the more relevants, where wmin has more impact in the size of the clusters found and osmax is more important in the performance. We modified these parameters between 0.2 and 1.0, keeping the other by default, to obtain our results.

RNSC
RNSC is a stocastic algorithm based on a search meta-heuristic aiming to optimize the network partition to define clusters based on a cost function. The algorithm has several parameters such as the tabu length, number of experiments, diversification length, and diversification frequency. We run RNSC with default parameters.

MCODE
MCODE is one of the earliest algorithms that provide a solution for protein complex prediction. We use a command line application for linux platform to run the experiments. The method has several parameters, among the most important parameters are the neighborhood density percentage which varies from 0 to 1.0 and the maxdepth parameter, which we set in 1000 and 10000. We defined the other parameters in their default values.

SPICI
SPICI is a method that has a web site to run it and it also has the software available for download at http://compbio.cs.princeton.edu/spici/ . The method is based on ranking nodes by weighted degree and build clusters greedily starting at seed nodes with decreasing degree. Clusters are formed by increasingly adding neighbors of seed vertexes that incrementing their densities. We tried different values for minimum density, including default 0.5 and different values of minimum support threshold. We varied these parameters between 0.2 an 0.8. We also defined the sparcity parameter (-m) with its possible values of 0,1,and 2.

COREPEEL
COREPEEL is a method that predict protein complexes in polynomial running time and works well in large PPI networks. The method is available for running in http://bioalgo.iit.cnr.it/. The approach is based on finding dense communities of the form of quasi-cliques. The method has two basic step, the first consist of applying a core decomposition of the graph where for each vertex in a graph provides a tight upper bound to the size of the largest quasi-clique that includes that vertex. And the second step consists of discarding (peeling out) loosely connected vertices from the quasi-cliques. The method has several parameters, such as the minimum density, maxumum size, subgraph min size, filter type (strict, medium, loose) and maximum jaccard separation. We tried between 50 and 100 minimum density, maximum jaccard separation between 0.5 and 1.0 and all the three filter types in all PPI networks.

False positive predicted protein complexes
Predicted protein complexes considered as false positive, i.e, protein complexes that are absent in gold standards are analyzed based on the information stored in PDBe containing protein complexes that have been characterized structurally. Many of these PDB ids are present in the Periodic table of protein complexes [7]. We report the complete lists for these candidate complexes in files with the extension .csv. Additionally we report predicted protein complexes found to be false positive that include these candidate protein complexes. This information is stored in files with the extension .xml. Both types of files are included in the results directory included in the software distribution developed in our approach (http://doi.org/10.6084/m9.figshare.5297314.v1).