Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks

Protein-protein interaction network data provides valuable information that infers direct links between genes and their biological roles. This information brings a fundamental hypothesis for protein function prediction that interacting proteins tend to have similar functions. With the help of recently-developed network embedding feature generation methods and deep maxout neural networks, it is possible to extract functional representations that encode direct links between protein-protein interactions information and protein function. Our novel method, STRING2GO, successfully adopts deep maxout neural networks to learn functional representations simultaneously encoding both protein-protein interactions and functional predictive information. The experimental results show that STRING2GO outperforms other protein-protein interaction network-based prediction methods and one benchmark method adopted in a recent large scale protein function prediction competition.

The realisation of the complex relationships between genotypes and phenotypes has 2 been fostering the collection and analysis of genome-wide datasets of molecular 3 interactions detected from patterns of physical binding, transcript co-expression, mutant 4 phenotypes, etc. Many specialised databases exist to store and integrate such 5 heterogeneous data at different levels of biological complexity. At one end of the scale, 6 the IMEx consortium gathers non-redundant protein-protein interactions (PPIs) from 7 peer-reviewed scientific publications, and provides manually curated details about the 8 experimental conditions [1]. At the opposite end, several resources extend these primary 9 data with indirect or predicted associations to paint a more complete picture for whole 10 organisms [2][3][4][5]. For instance, STRING [5] considers experimentally detected PPIs, molecular interactions. This modelling approach can be easily exploited using the 20 "guilt-by-association" principle: if the edges reflect biological facts reliably, adjacent 21 nodes have more similar functions than those further away in the network -e.g. because 22 they form a macromolecular complex, or their activities are coordinated in a specific 23 biological process. 24 The earliest methods therefore transfer annotations from nodes that are either 25 adjacent or within close distance, possibly taking into account the enrichment of the 26 functional labels [8]. Because the network topology is far from uniform and different 27 functions arise from unevenly sized gene sets, using one particular distance or number of 28 neighbours inevitably affects prediction accuracy. More sophisticated algorithms 29 therefore try to group the nodes into functional modules or communities -each 30 associated with a given function -and then make annotation transfers within them 31 [9][10][11][12][13][14]. The preliminary identification of functionally coherent subgraphs, however, poses 32 additional challenges, which can make module-assisted predictors less accurate than 33 those based on neighbour counting [15]. More recently, network propagation methods 34 have become increasingly popular to address a wide range of problems [16]. They 35 broadcast annotations from labelled proteins to others by running random walks, which 36 visit the nodes in the network randomly until stopping criteria are met [17][18][19]. If the 37 edges are weighted, this information controls the probability of traversing them; 38 otherwise equal probabilities are used. Because the propagation is affected by node 39 degree and edge weights, this approach reduces the chance of erroneous predictions from 40 highly multifunctional hub proteins to adjacent nodes, which perform fewer functions. 41 Alternatively, the transition probabilities can be used to encode directly the nodes as 42 multi-dimensional features, and thus to make functional annotations with nearest 43 neighbour strategies [20,21]. Cho et al. (2016) [22] and Gligorijević et al. (2018) [23] 44 have instead used them to embed the STRING networks jointly -that is to map nodes 45 to continuous features, which best explain the transition probabilities and the graph 46 topology. The usefulness of the resulting features has been demonstrated for the task of 47 protein function prediction. 48 This study proposed a novel PPI network-based protein function predicting method, 49 STRING2GO. It adopts deep maxout neural networks to learn a novel type of 50 functional biological network feature representations simultaneously encapsulating both 51 node neighborhoods and co-occurrence functions information. These higher-level 52 representations are learnt in a supervised way by training deep maxout neural networks 53 to output all the terms in biological process domain associated with an input protein -54 an approach that has led to higher predictive accuracy in the past [24,25] Firstly, human proteins were retrieved from the UniProtKB/SwissProt release 2017_05 60 [26], while the corresponding protein-protein interactions information was retrieved from 61 STRING v10.0 [27] that includes seven component networks from heterogeneous data 62 sources and one integrated network. The mapping between UniProtKB/SwissProt 63 accession numbers and Ensembl protein identifiers adopted in STRING was obtained by 64 using the Biomart tool [28].
propagated over "is a" relationships in the Gene Ontology database [30] -GO obo file 69 release 2017-04-28. To assure the feasibility of the following machine-learning 70 experiments, only biological process (BP) annotating at least 100 proteins were initially 71 considered. To guarantee that the predictions are sufficiently specific and informative, 72 this list was subsequently filtered so that only the deepest terms in the ontology were 73 retained -i.e. the terms a and b were kept if and only if there are no "is_a" paths from 74 a to b and from b to a. These steps yielded a vocabulary consisting of 204 BP terms 75 (detailed information is included in Table S1).

76
The set of human proteins was split into a large subset for GO term-specific classifier 77 training and a small subset for held-out evaluation. 10,667 proteins with at least one 78 cellular component term were initially selected from the whole set. Out of these, 1,000 79 proteins were randomly selected for held-out evaluation from the subset of 80 well-annotated entries -i.e. those with at least 28, 5 and 14 experimental or electronic 81 biological process, molecular function and cellular component terms respectively. After 82 removing electronic annotations, the held-out set for BP terms contains 982 proteins, 83 while the large set contains 5,000 proteins. In addition, we also create a separated 84 protein-set for a temporal annotation validation by selecting 428 proteins who had no 85 experimental annotation by any 204 BP terms but received at least one after 6 months. 86 The source files were collected from UniProtKB/SwissProt release 2017_11,

88
Predictive performance evaluation 89 Predictive performance was evaluated on the ability to annotate both individual labels 90 (GO term-centric) and protein function (protein-centric), following the methodology 91 adopted in [31]. For the GO term-centric evaluation, we calculate the F1 score for 92 evaluating the GO term-specific classifier training quality over 10-fold cross validation 93 on the large training protein-set and the predictive performance on the held-out 94 protein-set. In details, the GO term-centric F1 (i.e. F1 GO ) score is used for evaluating 95 the performance of methods when predicting protein annotations for individual GO For the protein-centric evaluation, we calculate the F max score by predicting the GO 104 term annotations for the held-out protein-set using the trained GO term-specific 105 classifiers. The F max score is used by CAFA experiments [31]   Then the corresponding τ to 117 F max score is used as the prior knowledge to calculate the other type of protein-centric 118 averaged F1 score, i.e. F τ , for the temporal annotation validation.
STRING2GO -a novel protein function prediction method Here, we 135 denote this type of STRING2GO method as STRING2GO Embedding+SVM for clarity. In 136 addition, due to the natural functionality of DMNNs, we also propose another type of 137 STRING2GO method, denoted as STRING2GO Embedding+Sigmoid , which directly adopts 138 the sigmoid function in the last layer of DMNNs to make predictions.

139
In this work, we evaluate the predictive performance of our two types of  In this work, we adopt two types of network embedding representation generation 146 methods, i.e. Mashup [22] and Node2vec [32], to derive representations from STRING 147 networks. Mashup firstly evaluates the diffusion states of nodes in the network by Analogously, Node2vec firstly obtains the node neighborhood information by 157 truncated random walks. Then a Skip-gram [33,34] shallow neural network is used to 158 generate a representation space, where the nodes contain the maximum likelihood of 159 preserving corresponding node neighborhood information. In this work, the 160 neighborhood information was sampled through random walks of length ten, which were 161 biased towards close neighbors by setting the parameter q to 2. We also evaluate the domain. Each hidden layer had batch-normalized inputs [35], which were combined 172 through maxout units [36], and were subject to dropout [37] in the course of training. A 173 sigmoid function was used to activate the output neurons.

174
To limit the computational requirements for model optimization, the initial 10-fold 175 cross validation (with random split of instances) experiments were run in order to Subsequent training stages were aimed at selecting the optimal dimensions of hidden 181 layers that lead to the highest median F1 GO scores (here rounded to two decimal places), 182 from a limited set of options (300, 500, 700 and 1,000). In addition, we also evaluate the 183 predictive performance when using the same dimensions for both input features and the 184 3 rd hidden layer outputs. Note that, due to the well-known curse of dimensionality issue 185 [38], if more than two different dimensions of the 3 rd hidden layer outputs obtain the 186 same median F1 GO scores, we only choose the lowest ones as the optimal dimensions.  network-derived representations obtain the second highestF1 GO score (0.22) using the 225 Mashup method, while also obtain the same highestF1 GO score (0.17) using the networks including relatively rich PPI information and high coverage. 235 We then report the optimal dimensions of network embedding representations 236 derived by Mashup and Node2vec methods from those 5 STRING networks. According 237 to the suggestion in [22], we define 800 as the optimal dimensions for the input network 238 embedding representations derived by Mashup. In terms of the Node2vec-derived 239 network embedding representations, as shown in the 5th column of Table 1, 128 are the 240 overall optimal dimensions, since 4 out of 5 network-derived embedding representations 241 in 128 dimensions obtain the highestF1 GO scores for predicting 204 biological process 242 terms. We then report the optimal dimensions of the STRING2GO-learnt functional  Table 1, the optimal dimensions of the 249 3 rd hidden layer outputs vary between 500 to 1000. Recall that we also evaluate the   Node2vec+SVM, when using all five different STRING networks to generate embedding 271 representations.

272
The held-out evaluation results further confirm that the STRING2GO-learnt 273 functional representations contain higher predictive information. As shown in Table 2 Table 2). STRING2GO Mashup+Sigmoid and STRING2GO Node2vec+Sigmoid also 281 respectively obtain higherF1 GO    representations obtain significantly higher GO term-centric F1 GO scores than the raw 295 network embedding representations.

296
From the perspective of protein-centric evaluation (i.e. considering the F max and F τ 297 metrics), the STRING2GO-learnt functional representations also obtain higher 298 predictive accuracy based on the Combinedscore network. As shown in Table 2 representations obtain higher precision and recall values simultaneously, since the 305 middle parts of red and blue curves locate in higher position than the orange one, while 306 the middle parts of grey and black curves also locate in higher position than the green 307 one. As shown in Table 2     We also compare the predictive performance of Mashup and Node2vec-derived network 321 embedding representations and the corresponding STRING2GO-learnt functional 322 representations respectively. Generally, the raw network embedding representations 323 derived by Mashup and Node2vec methods obtain competitive predictive accuracy by 324 using SVM as the classification algorithm. To begin with, during the training stage, the 325 F1 GO score obtained Mashup+SVM is higher than the one obtained by Node2vec+SVM 326 based on the Combinedscore network, since the orange bar is higher than the green one 327 in Fig 2.a. However, both Mashup+SVM and Node2vec+SVM obtain poor predictive 328 performance on the held-out evaluation, due to the zeroF1 GO scores. But the statistical 329 significance test results (see Table S2) show that the former still outperforms the latter. 330 Those patterns are consistent when using all other 4 types of STRING networks to 331 generate the raw embedding representations, as reported in Fig 2.b-2.e, Tables 2 and S1. 332 In terms of the protein-centric evaluation, Mashup+SVM obtains a higher F max score 333 (0.470) than Node2vec+SVM (0.444). The Combinedscore network-based  In addition, all of those methods discussed above obtains higher F max scores than Coexpression network. All those methods also obtain higher F τ scores than the Naïve 403 prediction method based on the Combinedscore and Textmining networks. indicating that the larger distances lead to higher predictive accuracy. 423 We also display an example of the increased distance between two classes of proteins 424 when predicting the term GO:0090150, which shows the highest improvement on the