A Network-based Approach for Predicting Missing Pathway Interactions

Embedded within large-scale protein interaction networks are signaling pathways that encode response cascades in the cell. Unfortunately, even for well-studied species like S. cerevisiae, only a fraction of all true protein interactions are known, which makes it difficult to reason about the exact flow of signals and the corresponding causal relations in the network. To help address this problem, we introduce a framework for predicting new interactions that aid connectivity between upstream proteins (sources) and downstream transcription factors (targets) of a particular pathway. Our algorithms attempt to globally minimize the distance between sources and targets by finding a small set of shortcut edges to add to the network. Unlike existing algorithms for predicting general protein interactions, by focusing on proteins involved in specific responses our approach homes-in on pathway-consistent interactions. We applied our method to extend pathways in osmotic stress response in yeast and identified several missing interactions, some of which are supported by published reports. We also performed experiments that support a novel interaction not previously reported. Our framework is general and may be applicable to edge prediction problems in other domains.

Case 4 [backwards edges from x → a, x → c, x → b, c → a, c → b, or a → b]: Similarly, it is easy to see that these edges cannot be used to decrease the distance between any pair to < 3.
Case 5 [combination of edges from b → x and b → c]: In the optimal solution, let there be s edges between b and some x's, and (k − s)-edges between b and some c's. Below, we argue that s = 0.
Each edge from b to x i yields d(b, x i ) = 1. Each of the (k − s) edges from b to c i yields d(b, x j ) = 2 for each x j linked to from c i (and where x j was not linked to directly). All remaining x's have distance 3 from b.
Thus, after k = q edges are added the APD equals: The first term in Equation (1) corresponds to (b, x i ) edges and contributes distance 1 for each x i . The second term corresponds to the (b, c k ) edges each of which yields a distance of 2 for up to three x's linked to from c k . The third term corresponds to the remaining x's whose distance was not reduced and which remains at 3. For the APD to equal 2, it must be that s = 0, which implies that there are exactly q edges from b to some c's in the optimal solution.
As a result, for a valid solution to exist, d(b, x i ) = 2 for all x i ∈ X . This is only possible if there exists a subset C ⊂ C, which contains exactly q elements and such that every x ∈ X is linked to from some c ∈ C . Because there are 3q total x's, it must be that each chosen c ∈ C links to a unique set of x's. Thus, C represent an exact cover of X .
Finally, the instance G in Figure S1A can be constructed in polynomial time from an instance of X3C defined by X and C. Hence, the Shortcuts problem is NP-complete.
Theorem 2. The Shortcuts-X problem is NP-hard for all r ≥ 2.
Proof. For brevity, we omit the proof but provide the instance used in the reduction from X3C ( Figure S1B). Each x i from the previous instance is now connected to a (r − 2)-long string of nodes. This is done essentially as a means to "pump up" the distance between the sources and targets such that each source-target pair will be exactly r hops away after k edges are added. A similar analysis follows.
Note that because the reduction instance used only one source, the Shortcuts-SS and Shortcuts-X-SS variants are also both NP-hard.

Predictions using the unoriented network
To show that the orientation step is indeed useful in extracting HOG paths given sources and targets, we ran each algorithm on the unoriented STRING PPI network ( Figure S2). We found that for both hop-restricted objective functions, the Greedy algorithm makes more HOG-relevant predictions when using the oriented network (53% vs. 46% for Shortcuts-X and 40% vs. 20% for Shortcuts-X-SS, compared to using the unoriented network). Moreover, the global methods (Short-Path and Jaccard) also benefited significantly from the orientation, which implies that defining network neighbors more precisely can help in identifying putative interactions.

Predictions using a hop-restriction length of 4
To explore the sensitivity of our results to the hop-restriction length, we repeated our computational experiments using a hop-restriction length of r = 4. Overall, we found similar qualitative performance for the algorithms when predicting from amongst all possible edges ( Figure S3). However, when predicting from amongst the potential set, we found only a few overlapping predictions with those made when the hop length was 5. Interestingly, these included the well-known HOG interactions Hog1→Cin5, Hog1→Msn2, and Hog1→Msn4, suggesting that the most confident and likely predictions are not wholly affected by the decreased hop restriction. Of course, some different predictions are also to be expected; for example, using a hop length of 4, the algorithm makes predictions for Sho1→Hog1 and Ste50→Hog1. While these predictions make sense algorithmically, they do not make sense biologically because they attempt to shortcut the sources of the pathway directly to a core node (Hog1). This suggests that 4 hops may be too restrictive and may motivate using a hop restriction of 5 in future efforts.

Predictions without using the STRING-derived edge weights
To understand how valuable the STRING-derived edge weights are to our approach, we used the Greedy algorithm to predict edges from the STRING potential set without considering the weight of the edge. (Each potential edge was given a default weight of 0.) In this test, 1 of the top 10 predicted edges overlapped with the predictions when using the STRING-derived weights (Pvalue = 1.5e −4 , Fisher's exact test), and this was true for both Shortcuts and Shortcuts-X. When using the weights, 3 of the top predictions for Shortcuts-X connected well-known interactors in the HOG pathway (Hog1→Msn2,Msn4,Cin5) yet none of these predictions emerged in the weight-less predictions. For Shortcuts, several interactions were predicted involving Hot1 (Hot1→Msn2,Msn4,Smp1), all of which have evidence for physical interaction in BioGRID, but none of which were made when using the weights. Thus, as expected, leveraging the weights leads to very different predictions and suggests that there may be multiple ways of connecting HOG sources to targets. Further, this result demonstrates that our objectives are well-defined and are able to uncover some missing edges with or without weights.

Microarray analysis and processing
The tpk2∆ strain was taken from the Yeast Deletion Library (BY4741 background). Cells were grown to logarithmic phase in SC medium (OD=0.5), washed, and resuspended in SC medium with 1M sorbitol for 30 minutes. The cells were harvested, pelleted, and frozen for further analysis. Total RNA was extracted using MasterPure TM yeast RNA purification kit (Epicentre). The samples were amplified, labeled, hybridized to custom Agilent microarrays (GEO platform GPL14666), and scanned using standard Agilent protocols, reagents, and instruments. Significance analysis of microarrays [2] was run using the two class unpaired response type, 1000 permutations, minimum twofold expression change, the delta that yielded a false discovery rate of at most 0.2, and de-fault values for all other settings. Genes for which both control measurements or both treatment measurements were missing were discarded.
The custom Agilent microarrays (GEO platform GPL14666) require normalization for a known probe distance bias. Probe distance, the distance between the probe and the 3' UTR end, was derived from reported 3' UTR lengths [3], and we used the average length for genes missing data. Gene expression was normalized by subtracting the Lowess fit of the probe distance and expression and then performing standard quantile normalization. Control probes were discarded, and probe expression was aggregated to the gene level by taking the median. Normalized expression data has been deposited in GEO with accession number GSE28213. Differentially expressed genes were compared to TF binding interactions [4] using the version of the dataset with a 0.005 P -value threshold and no motif conservation requirement.

Testing significance versus other perturbation assays
We used the Rosetta compendium [5] of 300 knockout (KO) expression experiments and compared the overlap between differentially expressed (DE) genes in each experiment with the list of Sok2 targets. DE genes in each experiment were identified as those with P -value < 0.005 (the same value was used as the Sok2 TF binding threshold). For each experiment, we computed the significance in overlap of the DE genes and the Sok2 targets using Fishers exact test. Of 301 experiments, only 31 (10.3%) had a lower P -value than the one obtained from our TPK2 KO.
It is not surprising that the deletion of other genes also leads to the differential expression of some Sok2 targets, but the fact that this occurs for only a fraction of experiments suggests that our KO holds against the statistical background. The physical interaction network provides evidence that, unlike the Tpk2→Sok2 interaction, the other KOs more significantly associated with Sok2 represent indirect relationships. None of the other corresponding protein products directly bind Sok2 according to STRING. Rather, they affect Sok2s bound genes via indirect cascades with an average of 3.5 protein interactions between the KO and Sok2 in the network. This suggests that simply looking at KO significance is not enough and that an approach like ours is necessary to find direct interactions.
In the other direction, we considered 117 additional TFs [4] and measured how many of them were affected by the TPK2 deletion. For each TF, we computed the overlap between the set of DE genes in our TPK2 KO and the set of TF targets and computed significance of the overlap using Fishers exact test. Similar as the test above, of the 118 tests only 14 (11.9%) had a lower P-value than our predicted Tpk2-Sok2 pair. Again, it is not surprising that other TFs were affected by the KO because deletions can affect both direct binding partners and proteins further downstream. The more significant Tpk2-TF associations do not correspond to direct binding in the interaction network -the average distance in the interaction network is 4.8 edges -which suggests that these are not candidates for missing interactions.