Machine Learning of Protein Interactions in Fungal Secretory Pathways

In this paper we apply machine learning methods for predicting protein interactions in fungal secretion pathways. We assume an inter-species transfer setting, where training data is obtained from a single species and the objective is to predict protein interactions in other, related species. In our methodology, we combine several state of the art machine learning approaches, namely, multiple kernel learning (MKL), pairwise kernels and kernelized structured output prediction in the supervised graph inference framework. For MKL, we apply recently proposed centered kernel alignment and p-norm path following approaches to integrate several feature sets describing the proteins, demonstrating improved performance. For graph inference, we apply input-output kernel regression (IOKR) in supervised and semi-supervised modes as well as output kernel trees (OK3). In our experiments simulating increasing genetic distance, Input-Output Kernel Regression proved to be the most robust prediction approach. We also show that the MKL approaches improve the predictions compared to uniform combination of the kernels. We evaluate the methods on the task of predicting protein-protein-interactions in the secretion pathways in fungi, S.cerevisiae, baker’s yeast, being the source, T. reesei being the target of the inter-species transfer learning. We identify completely novel candidate secretion proteins conserved in filamentous fungi. These proteins could contribute to their unique secretion capabilities.


Introduction
Protein secretion is a fundamental cellular process that is required for transporting proteins into cellular compartments, the cell surface and the external space of the cell as well as for covalent modification i.e. disulphide bond formation and glycosylation of proteins. As can be expected from its central role, the protein secretion machinery is conserved in eukaryotes. Fundamental research to unravel its functioning has been carried out in the fungus Saccharomyces cerevisiae [1]. However, the baker's yeast S. cerevisiae of the subphylum Saccharomycotina does not naturally secrete large amounts of proteins unlike the filamentous fungi of the subphylum Pezizomycotina. For example the Pezizomycotina Trichoderma reesei (Hypocrea jecorina) is able to secrete its native cellulase proteins with yields of over 100 g/l in industrial problem, to predict, whether a pair of proteins interact or not. Thus, any general model for classification learning is applicable in this setting, including ensemble learners [15][16][17], Naive Bayes, and support vector machines (SVM). SVM models rely on so called pairwise kernels, where the similarities of protein pairs are compared to each other. Another class of PPI learning methods aim to predict interaction patterns by learning similarities between proteins in the protein interaction network. Output kernel trees [18] and input-output kernel regression [19] are recent examples of this kind of methods.
The above approaches have not been explicitly applied to cross-species transfer learning, perhaps due to the limited amount of verified PPIs in a majority of species. Beyond basic sequence comparisons, more advanced computational methods have been applied in the crossspecies setting only sparingly. In [20], a cross-species cluster co-conservation method is proposed, that exploits phylogenetic profiles for predicting protein interaction networks. In [21], a link propagation approach was proposed relying on gene expression and sequence similarity, applied to cross-species metabolic network reconstruction.
There is a dire need for novel function and interaction prediction methods that would be locally available, able to cross large sequence similarity distances and not require the solving of orthology-paralogy relationships to cope with the rising amount of genomes. In this paper, we introduce a framework of machine learning methods that can be used for predicting physical or functional protein-protein interaction or more specific biological networks i.e. metabolic pathways depending on what type of training labels are used. Our method uses as features various sequence similarity and protein family analysis derived from the CoReCo pipeline [22]. Although our method relies partly on sequence similarity, it is, through a combination of methods, still able to predict for proteins that do not belong into any known protein family. Hence our method can give clues for PPIs of previously unknown proteins. Our method introduces recently proposed multiple kernel learning (MKL) methods [23] to supervised network inference, thus boosting the performance of the latter method family and making full use of the wide array of sequence-derived features.
We focus in predicting the secretion machinery in industrially relevant fungi, in particular, T.reesei. Our focus is in predicting functional protein-protein interactions (PPI) in the secretory pathway. As there are no verified protein interaction data available for these organisms, we assume the cross-species transfer learning setting, where the training data comes from S. cerevisiae, and prediction targets is T.reesei.

Data and preprocessing
Sequence data. In this paper the models are based on features that can be computational derived from protein sequence data. The sequence data for the two studied organisms were downloaded from SGD database (http://www.yeastgenome.org) for S. cerevisiae and from JGI Mycocosm database (http://genome.jgi.doe.gov/Trire2/Trire2.home.html) for T. reesei.
Protein-protein interaction data. The machine learning methods require a set of known PPIs to be used as ground truth for the model output, used for training and testing the model. We obtained our PPI data from the recently published genome-scale model of the yeast secretory machinery [24] that gathers knowledge of 50 years of research on secretion in S.cerevisiae. The authors identified 162 proteins to be involved in secretion that are assigned to 16 subsystems such as translocation, ER glycosylation, COP, Golgi processing etc. These protein complexes give 2200 undirected interactions between the 162 secretion proteins which are used as training labels.
Feature extraction. For our models we use several types of features to characterize the similarity of proteins as well as the similarity of protein pairs. For all protein sequences of the 2 organisms we computed the following features using the CoReCo pipeline [22]: sequence alignment with BLAST against the UniProt database as well as Global Trace Graph (GTG) [25], protein domains and functional sites gathered by InterProScan [26] from its member databases: Pfam [27], Panther [28], Gene3D [29], PRINTS [30], Prosite [31], PIRSF [32], SMART [33], and SUPERFAMILY [34] (See S1 Table for details on these data sources).
Artificial sequences. We used artificial data to test if the different biological network inference algorithm that have been developed for intra-species prediction also work for interspecies prediction with low sequence similarity. They are well below commonly used amino acid sequence identity cut-off values. For obtaining artificial sequences with varying levels of sequence similarity we altered the sequences of the 162 secretion proteins of S. cerevisiae based on Blosum matrices [35]. These matrices represent the substitution probabilities from an amino acid to an other amino acid in natural sequence data sets. Hence, they allow approximation of natural sequence evolution. We created four different data sets where we deleted and mutated 70%, 60%, 38% and 20% of the amino acids according to the Blosum30, Blosum40, Blosum62 and Blosum80 respectively. The Blosum matrices were downloaded from NCBI Blast site (ftp://ftp.ncbi.nih.gov/blast/matrices/). Each different Blosum matrix has been made by combining proteins that are no more similar than a given percentage (30%, 40%, 62% and 80%) to one single sequence and then comparing only those sequences [35]. In Fig 1, the percentage of amino acid sequence identity between the artificially mutated protein sequences of S. cerevisiae and T. reesei based on the Smith-Waterman alignment is shown. Based on visual comparison, the generated Blosum30 data set has a similar level of sequence similarity to S. cerevisiae as T. reesei. In the experiments, the artificially perturbed sequences were coupled with the labels of the corresponding labels of the original sequences.
Transcriptomic data analysis for biological network validation. The transcriptomic data for the validation of T. reesei PPI network was composed by eight publicly available data sets taken from Gene expression omnibus [36] plus eight in-house data sets. The public data sets contained 76 samples all together and the in-house data sets 499 samples. Once combined the final data set contained 575 samples and 9078 genes. Each data set was normalized separately using quantile normalization [37] and normalized again after they were combined using COM-BAT normalization [38].

Problem formalization
Supervised graph inference has been introduced a decade ago in [14] and has been widely used for biological network reconstruction subsequently. Given a set of nodes V = v 1 , .. ,v m a biological network can be defined as an undirected graph G = (V, E) where E & V × V are the edges between the m vertices. The graph can be represented by a symmetric adjacency matrix Y = (y ij ) of size m × m where y ij = y ji = 1 if the nodes v i and v j are connected and y ij = y ji = 0 otherwise. We will also use the shorthand yðv i Þ ¼ ðy ij Þ m j¼1 to denote the connectivity pattern of protein v i in the network. In addition, we assume that each node has assigned features x(v i ) 2 χ, for some input space χ.
The learning task is then defined as follows: given partial knowledge of the graph G = (V, E) and the feature representation of the nodes, determine a function f: V × V ! {0, 1} that best approximates the unknown edges of the graph.
Note that the main difficulty for solving this problem is that the features are assigned to individual nodes and the labels to pairs of nodes [9]. To transform the task into a standard classification problem, we use a global approach that tries to find a feature representation for pairs of nodes. Another issue inherent to biological network inference is the substantial class imbalance since the number of positive interactions is small compared to the number of all possible interactions. Thus special care is needed for setting up the evaluation experiments, see e.g. [39]. First of all, the evaluation metrics should be chosen such that the class imbalance does not lead to incorrect conclusions (e.g AUPR metric explained below). Secondly, methods that predict for each protein an interaction profile (see OK3 and IOKR below), represented as a multilabel, a binary vector containing interaction labels for all other proteins, are able to mitigate the class imbalance, since in general the set of multilabels are diverse with no very frequent multilabel. In [9] it is recommended to perform cross validation on the nodes as cross validation on pairs tends to give too optimistic results. A schematic representation of the duality between the biological network and the adjacency matrix and the cross validation on nodes is given in Fig 2. Finally, for performing inter-species biological network inference we use the protein sequences and their interactions from one species as training set and the protein sequences from the second species as testing set. Note that in this setting the training-testing interactions are not of interest and that the feature representation needs to be the same for training and testing proteins.

Inference Algorithms
In this section we present three different approaches for supervised network inference that we have applied to inter-species PPI network prediction. Additionally, we present different approaches for learning kernels that account for the relevance of a data source for the learning task.
Output kernel trees (OK3). have been proposed by [18] and are based on the kernel embedding of the graph where the kernel function is defined as is defined such that adjacent vertices have higher values of k Y than non-adjacent ones. To achieve this, the diffusion kernel is commonly used K Y = exp(−βL) where L is the Laplacian matrix of the graph L = D − Y with D being the degree matrix and Y the adjacency matrix. Additionally, β > 0 is a user defined parameter that controls the diffusion degree.
The OK3 algorithm relies on the top-down induction algorithm widely used to learning decision trees (e.g. CART [40]). The methods start with a tree represented by a single leaf and then recursively partition (or split) the input data S until the data is homogeneous enough (in our case: the proteins in S have similar connectivity patterns). The data arriving to leaf L of the decision tree is split into two parts S l and S r , using a binary test T t (x) 2 {0, 1} based on a value of a single input feature of x (e.g. does protein have a given motif or not). The two sets S l = {x 2 S|T(x) = 0} and S r = {x 2 S|T(x) = 1} will be recursively used to grow subtrees which then will be attached as the children of L.
For learning the decision trees on the input vectors .m the following score is maximized to select a test T to be inserted in the decision tree leaf given the set of inputs S routed to the current decision tree leaf: where ψ(v) is the output feature vector, N, N l and N r are the sizes of the training sample S and its left and right split, S l and S r , respectively. The variance of the output feature vectors in the set S can be easily computed using the kernel trick: One main advantage of the OK3 approach is that the decision tree on the input features results in a ranking of relevant features for the learning task.
Then for prediction each leaf L is labeled with a predictionĉ L ¼ 1 N l P N L i¼1 cðv i Þ analog to standard regression trees where N L are the number of samples that reach the leaf. Finally, the kernel value between two vertices v and v 0 where x(v) reaches leaf L 1 and x(v 0 ) leaf L 2 respectively can be approximated by thresholdinĝ where v k i ; i ¼ 1; . . . ; N L k enumerate the vertices routed to leaf L k . For improving the accuracy of the method an ensemble of decision trees also known as a random forest is used. In our experiments we used the C code provided by the authors [18].
Kernels on protein pairs. The main idea of the biological network reconstruction methods presented in [8] is to reformulate the task as a pattern recognition problem: given a training set τ = {(u 1 , t 1 ),(u 2 , t 2 ), .. ,(u N , t N )} of patterns u i 2 < q with a binary label t i 2 {−1, 1} infer a function f: < q ! {−1, 1} for any new pattern u. The main hindrance in doing so is that in network reconstruction the labels are defined on pairs of vertices and the input features or patterns on individual vertices. Thus in a first step a so called linear kernel on pairs of vertices induced by their input features is defined by their inner product k X (v, v 0 ) = x(v) T x(v 0 ). These kernels k X represent the similarity of any pair of protein sequences that are then used to compute kernels on pairs of protein pairs as follows Now a standard support vector machine (SVM) can be used to solve the binary classification task. Since PPI networks are undirected the tensor product kernel k TPPK and the metric learning pairwise kernel k MLPK are best suited for modelling the similarity between protein pairs.
Despite the method's good predictive performance it has a major drawback: the kernels between pairs of proteins can become quickly very large even for a reasonable amount of protein sequences. The space complexity for storing the kernel matrix turns out to be O(m 4 ) where m is the number of proteins in the biological network which leads to serious scalability problems and usage of computational resources [21].
Input-Output Kernel Regression (IOKR). This method combines elements of the two previous algorithms that circumvent their respective disadvantages-on the input side it uses the simple kernels on protein pairs and on the output side it uses the diffusion kernel built from the adjacency matrix of the output graph. But the classification problem is addressed by solving a kernel learning problem using regularized regression [19,41]. The method comes in two flavors: the supervised version learns only the kernel ridge regression model and the semi supervised one adds a smoothness constraint using the inputs of labeled data and auxiliary data, called unlabeled data.
As the OK3 method, IOKR proposes to solve the link prediction problem by learning an output kernel k Y : V Â V ! R, that encodes the similarities between the proteins in the interaction network. After learning this kernel, positive interactions can be predicted for the kernel values that are higher than some threshold θ: As k Y is a kernel, its values can be written as: The IOKR method approximates the output feature map ψ with a function h and then build an approximation of the output kernel k Y by taking the inner product between the values of this function:k Y ðv; v 0 Þ ¼ hhðvÞ; hðv 0 Þi : Thus learning f θ reduces to learn the single variable function h. Then given models of the general form h M (v) = Mϕ(v) and assuming a regularized square loss function the parameters of the supervised IOKR model can be estimated based on l training samples as follows: where λ 1 > 0 is a regularization parameter that is tuned with cross validation for the experiments.
The method has also been extended to the semi-supervised setting where the input of unlabeled data is taken into account. The new cost function that has to be minimized is: where L X n = exp(−β(D n − K X n )) denotes the diffusion kernel associated to input kernel matrix on labeled and unlabeled data. The last term constrains proteins that are similar to each other in input to be similar in the predicted interaction network. λ 1 > 0 and λ 2 > 0 are two regularization parameters that are tuned with cross validation for the experiments. Both minimisation problems lead to a closed form solution that can be found in Propositions 4 and 6 of [19].
Multiple Kernel Learning (MKL). The heterogeneous set of features that we extracted from the protein sequences is expected not to uniformly contribute information to the learned model which makes the uniform combination of the kernels over the different data sources suboptimal. Therefore we apply Multiple Kernel Learning (MKL) to take the feature's relevance into account. We focus on linear mixtures of kernels, where the weights μ q are typically restricted to be non-negative to ensure the PSD property of the resulting mixture. Note that setting μ q = 1 for all kernels yields the uniform kernel combination. A major step forward in the MKL field was learning kernels based on centered kerneltarget alignment [23] where h.i F is the Frobenius product, k.k F the Frobenius norm, K Y is a target kernel and K c denotes a centered version of the input kernel K, achieved by the centering operation where 1 denotes the vector of ones and I is the identity matrix. This gives a simple improvement over the uniform combination of kernels be directly using the kernel-target alignment scoresrðK q ; K Y Þ as a mixture weights: This MKL method is called ALIGN. In [23] it is claimed that the kernel centering is critical for the kernel alignment score to correlate well with performance.
The previously presented independent kernel alignment neglects the correlation between the base kernels which can be overcome by jointly maximization the alignment between the convex combination kernel with the target kernel and is also referred to as ALIGNF: With the constraints that kμk 2 = 1 and μ ! 0 the alignment maximization problem can be rewritten as: where a = (hK 1c , K Y i F , . . ., hK rc , K Y i F ) T records the kernel-target alignments of the input kernels and M = (M ql ) ql with M ql = hK qc ,K lc i F contains the pairwise kernel alignments between the input kernels. The problem can be solved by quadratic programming [23]. Another approach for optimizing the kernel target alignment has been proposed in [42]. The method aims at sparse combinations of kernels by regularizing the kernel weights by ℓ pnorm, where 1 ! p is simultaneously optimized. The proposed generalized ℓ p -norm kernel target alignment formulation is as follows: The squared Euclidean distance in the first term is an instantiation of Bregman divergence [42] B F ðμÞ ¼ FðμÞ À ðμ À μ 0 Þ T rFðμ 0 Þ for F(μ) = hμ, μi, and μ 0 is a fixed point in the domain of F (Following [42] we used μ 0 = 0 in our experiments.). Additionally, λ 1 0 and λ 2 0 are the regularization parameters. For implementing the sparsity inducing l p regularizer p is systematically reduced towards unity till a sufficient level of sparsity is obtained. The solution of the path following is computed with a Predictor-Corrector algorithm [42].

Evaluation metrics for Binary Predictions
Binary classification problems are typically evaluated with the accuracy measure which is computed as the number of correctly predicted pairs divided by the total number of pairs. For highly imbalanced problems like network inference accuracy is not an appropriate measure because it favours the majority class and thus the non-interactions. In the following Receiver-Operator-Characteristic (ROC) and Precision-Recall (PR) curves are presented which are better suited for evaluating network inference predictions [43]. Both measures are based on a so called confusion matrix which is 2 x 2 for binary classification with the columns and rows representing the predicted and the actual classes respectively. Denoting interactions as positive and non-interactions as negative the confusion matrix is given in Table 1.
From this matrix several measures for model evaluation can be derived: • True positive rate (TPR): also known as sensitivity or recall, is the number of true positives divided the number of the actual positives TP/P  All of these measures need to be combined in order to give a reliable performance measure of an algorithm e.g. specificity and sensitivity or precision and recall. Note as well that a threshold needs to be defined if predictions are confidence scores. For evaluating algorithms with varying confidence thresholds ROC and PR curves can be used.
ROC curves. plot the TPR over the FPR for varying confidence thresholds. More specifically, each threshold corresponds to a different confusion matrix and thus a different pair of values for TPR and FPR and a point on the ROC curve. The end points are always (0, 0) and (1, 1) and a perfect classifier would pass through the point (0, 1), while a random classifier would be a diagonal connecting (0, 0) and (1, 1). A common summary statistic of the ROC curve is the area under the ROC curve (AUROC). AUROC is one for a perfect classifier and 0.5 for a random one. For the highly imbalanced network prediction tasks even moderate FPR can lead to more FP predictions than TP predictions and hence a very low precision.
PR curves. plot the precision over the recall for varying confidence thresholds. The curve starts at a pseudo point (0, 1) and ends at (1, P/(P + N)) which corresponds to to predicting all pairs as positive. An optimal classifier would pass as well through (1, 1). The area under the PR curve (AUPR) is also a common summary statistic. As for AUROC one assumes that the higher the AUPR the better the performance of the method. One advantage of PR curves over ROC curves is that they allow to measure early precision where recall is low and thus gives a tool to evaluate the quality of the top ranks of the result list.

Results
We report here on three sets of experiments. First, we evaluate how the prediction methods perform under simulated sequence data, representing differing amount of genetic distance between the source and target species. Second, we check how well the methods separate the secretory pathway from the rest of the genome. Third, we evaluate the PPI prediction in the cross-species transfer learning from S. cerevisiae to T. reesei.

Network reconstruction for evolutionary distant sequences
Here we compare the performance of the network inference methods Output Kernel Trees (OK3), Tensor kernel SVM on protein pairs (PP), and supervised and semi-supervised Input-Output Kernel Regression (IOKR) for evolutionary distant species. As training data, we use the S. cerevisiae secretory pathway protein sequences as input and their functional interactions as labels. Then we try to predict these interactions in secretory pathway protein sequences that were perturbed using different BLOSUM matrices that correspond to different genetic distances. Fig 3 depicts the Receiver operating characteristic curves (ROC) with associated areaunder-curve (AUC) statistics for each inference method for the different evolutionary distances. As expected, all methods predict the better the smaller the distance with BLOSUM80 curves having the highest AUC and being closest to the top-left corner of the plots. The curves are averages of 20-fold cross-validation experiment.
In terms of AUC, OK3 obtains the best results, tensor kernel (PP) the second best and the IOKR methods being somewhat less accurate. However, closer examination of the method's prediction performance for top-ranked interactions (FPR < 0.1) reveals that the IOKR methods in fact have the best early precision, thus would get the top-ranked interactions more accurately predicted than the competing methods.
The AUC statistics and the ROC curves of OK3 follow a smoothly worsening pattern with respect to the increasing evolutionary distance, while the other methods manifest a step change so that BLOSUM30 is markedly worse in AUC and lies clearly below the other curves. Fig 4 depicts the Precision-Recall (PR) curves of the same experiment. Here, the IOKR methods clearly perform best, having close to perfect precision regardless of the evolutionary distance until recall level of 0.5 and then a sharp drop at recall levels of 0.7-0.9 depending on the evolutionary distance. In contrast, OK3 manifests a close to one precision only for BLO-SUM80 and for recall levels up to 0.3. Pairwise kernels do not obtain a high precision and produces a pattern that is inverted with respect to the evolutionary distance, indicating a high number of false positives in the SVM classifier and possible overfitting when the evolutionary distance is small.
The ROC and PR curves together indicate IOKR as the best compromise, given that both a high overall accuracy and high initial precision are desirable for network reconstruction.
Identifying secretory pathway PPIs from full genome Next, we check how well transfer learning of the secretory pathway works in the basic case of the source and target species being the same. In this experiment, the inference models were trained on the S. cerevisiae secretion proteins and their functional interactions, and the goal is to test the ability of the models to correctly identify the secretion pathway proteins among all S. cerevisiae proteins. In this setup, the ground truth is composed of PPIs between two secretory pathway proteins as the positive class and all other interactions as the negative class (true interactions between one or two non-secretory proteins as well as missing interactions between pairs of secretory pathway proteins). Fig 5 depicts the results of a 5-fold cross-validation experiment for the different network inference methods. In the ROC space (left pane), Pairwise kernels and the two IOKR methods are close in performance, with the semi-supervised IOKR being marginally better than the two others. OK3, however, performs significantly worse than the other three methods. In the precision-recall space (right pane), the two IOKR methods are the most robust in the lowrecall regime, with the semi-supervised variant maintaining 0.7 precision rate up to 0.6 recall rate. Pairwise kernels and OK3 demonstrate a different pattern: they suffer from a high false positive rate in the low-recall regime, but have a good precision in mid-recall regime, before tailing off.
Analyzing the ROC and PR results together, semi-supervised IOKR emerges as the best compromise, due to its good ROC behaviour and good precision in the low-recall regime. It appears that the semi-supervised aspect gives some protection for the method against false positives in the low to mid-recall levels.

Comparison of Multiple Kernel Learning methods
Next, we compare the different MKL methods ALIGN, ALINGF and p-norm path following on the reconstruction of the set of secretion proteins from the full genome of S. cerevisiae when using semi-supervised IOKR as predictor.
The results are shown in Fig 6. It can be seen that the MKL methods perform better than the simple sum of input kernels (UNIMKL) in terms of ROC curve as well as PR curves.
Nonetheless, the gains of MKL are smaller than we expected them to be. Looking at the ROC curves, the p-norm path following MKL outperforms the other methods, whereas for the PR measure the simpler ALIGNF outperforms all other methods with p-norm path following being the second best. Secretion network prediction for Trichoderma reesei Finally, we evaluate the PPI prediction quality in an inter-species setup, where the training data comes from S. cerevisiae and the target species is T. reesei. However, no experimental protein interaction data exists for T. reesei that could be used as the ground truth. Thus we focus on qualitative analysis of the predicted T. reesei secretion network by expert knowledge. For  predicting the PPI in T. reesei we used semi-supervised IOKR and p-norm path following for learning the input kernel, since this method combination achieved the best performances in the previous experiments.
In order to validate the predicted T. reesei secretion network, its genes (T. reesei genome version 2.0 [44]) were annotated with a combination of sequence similarity based methods: best BLASTp [45] match to S. cerevisiae proteins, best BLASTp match to UniProtKB/Swiss-Prot [46], Interproscan domain predictions [26], PANNZER description line and GO-category predictions [47] and a manually curated set of Aspergillus niger protein secretion related genes [48].
The T. reesei secretion network contains in total 320 genes. According to the annotation described above 27 genes belong to the heterokaryon incompatibly family and are sequence wise very similar. This family contains a GTPase domain that could contain similar features as GTPases involved in secretion. 51 genes belong to other than secretion related categories of cellular function. 18 genes were annotated to be related to cell growth, cell wall synthesis and cell motility and six were found to be related to chromatin modification. In general these 24 proteins contain domains related to small molecule modifications of macromolecules such as glycosylation, phosphorylation, ubiquitinylation and methylation. Similar molecular functions are abundant in the known secretion pathway enzymes. 14 of the 51 were annotated as molecular and cellular function unknown (Column 'Class' in Table 2). Hence, manual annotation based Table 2. Unknown genes and genes without any interactions in STRING in predicted T. reesei secretion network. Column 'Gene' contains the T. reesei gene ID. 'In STRING' tells if the gene has interactions in STRING. Columns 'Btw' and 'Deg' denote the betweenness and degree network statistics of the corresponding gene. Columns 'Class' and 'Putative secretion pathway component' are author assigned classifications. 'Taxon specificity' gives the largest taxonomic group the gene was found in. on sequence similarity suggests a minimum of 75% true positive rate and a maximum of 20% false positive rate. The predicted secretion network, excluding heterokaryon incompatibility family and other than secretion related genes in order to ease visual inspection of know genes, is shown in Fig 7. An alternative layout (S1 Fig) and a table format (S2 Table) of the network are also provided as supplementary material. 16 of the 320 genes were found to have no interactions in the protein-protein interaction database STRING [4] (Column 'In STRING' in Table 2). For 2 of these we found strong similarity based evidence that they are part of the Golgi mannosyltransferase complex and could be considered as false negative predictions by STRING.
For each gene annotated as unknown, a putative role in the secretion pathway machinery was assigned based on their position in the predicted network (Column 'Putative secretion pathway component' in Table 2). To estimate the novelty of such secretion pathway components the taxonomic distribution of the unknown genes was estimated with multi-genome protein clustering [49] (Column 'Taxon Specificity' in Table 2). All unknown genes were found to be restricted to the subphylum Pezizomycotina or a smaller taxon with-in Pezizomycotina.
In order to further validate the T. reesei secretion network we used a combined transcriptomics data set of public and in-house data (see Methods). Pearson correlation of the expression values of all gene pairs that have a predicted PPI (an edge in the PPI network) was computed. The average of absolute values of these correlations was found to be 0.2 with an empirical p-value of p < 0.05. This p-value was calculated by rewiring the network 1000 times with the igraph function 'rewire' [50] and counting the average of absolute correlation each time. Absolute correlations above 0.3 are highlighted in

Discussion
Experimental measurement of protein-protein interactions is technically demanding and often different methods can give conflicting results [7,51]. Also, even a reliably measured interaction might not have a detectable biological function. To circumvent such challenges we use an expert curated interaction network of functional associations derived from numerous experiments [24].
We tested several recent machine learning methods for the task of PPI prediction. Classification models tested included pairwise kernels, output kernel trees as well as supervised and semi-supervised input-output kernel regression. The methods differed in performance depending on whether ROC or PR was used as the evaluation metric. Semi-supervised IOKR proved to be the best compromise when both evaluation metrics were taken into account: it had the best PR performance and a reasonable ROC-this choice puts an emphasis on good performance in the positive class, required for reliable network reconstruction.
Multiple kernel learning methods tested included uniform kernel combination, methods based on centered kernel alignment as well as the newly proposed p-norm path following algorithm. In our tests, we found that generally p-norm path following performed best in the ROC metric while other methods were close to each other in performance. In the PR metric, ALIGNF outperformed the other methods and p-norm path following being second best. Altogether, p-norm path following seems to give the best performance, although the improvements of MKL over no MKL were smaller than expected.
To demonstrate our prediction approach we predict protein secretion network for T. reesei, an industrially important protein production organism, which has no experimentally verified PPIs to date. Novel understanding of their protein secretion network machinery could have significant impact in the generation of improved protein production strains through targeted engineering.
For T. reesei we find that the predicted network is well supported by sequence similarity based manual annotation and by transcriptomics data. Most importantly the predicted network includes 14 previously unknown genes that are taxonomically restricted to Pezizomycotina and hence could explain their exceptional protein secretion capabilities.
Finally we note that our set up does not need complex external database systems or specialized experimental data to be generated, but relies on data available through standard sequence searches, evaluated through fast machine learning models. Hence, our methods are amenable to local implementation as part of a genome annotation pipeline.
Supporting Information S1