SANTA: Quantifying the Functional Content of Molecular Networks

Linking networks of molecular interactions to cellular functions and phenotypes is a key goal in systems biology. Here, we adapt concepts of spatial statistics to assess the functional content of molecular networks. Based on the guilt-by-association principle, our approach (called SANTA) quantifies the strength of association between a gene set and a network, and functionally annotates molecular networks like other enrichment methods annotate lists of genes. As a general association measure, SANTA can (i) functionally annotate experimentally derived networks using a collection of curated gene sets and (ii) annotate experimentally derived gene sets using a collection of curated networks, as well as (iii) prioritize genes for follow-up analyses. We exemplify the efficacy of SANTA in several case studies using the S. cerevisiae genetic interaction network and genome-wide RNAi screens in cancer cell lines. Our theory, simulations, and applications show that SANTA provides a principled statistical way to quantify the association between molecular networks and cellular functions and phenotypes. SANTA is available from http://bioconductor.org/packages/release/bioc/html/SANTA.html.


Introduction
High-throughput studies, like measuring differential expression in RNA-seq experiments or morphological changes in RNAi screens, produce genome-wide data that are often difficult to interpret. Functional annotation of hits in such studies relies mostly upon gene set analysis methods, which measure an association between hits and pre-defined gene sets [1,2] by quantifying overlap [3] or concurrent trends [2]. These approaches generally treat hits as independent; only very few exploit an internal structure [4]. Recently, many high-throughput studies have produced networks of physical [5,6] or genetic [7][8][9] interactions, rather than lists of hits. These networks can be even harder to interpret than lists of hits. While more and more networks are being generated, rigorous statistical methods to annotate their functional content are lacking, thereby making it difficult to identify and quantify any high-level changes. The need for rigorous functional analysis of networks becomes especially evident when trying to quantify the often subtle functional adaptions observed in networks specific to external stimulation [10,11], different phenotypes [9], cell types [12,13], or diseases [14,15].
Here, we develop methodology, called SANTA, for the rigorous and unbiased functional annotation of molecular networks. The basic input to SANTA are a network and a gene set and the output is the statistical significance of their association (Figure 1). Like Gene Set Enrichment Analysis [2], SANTA measures concordant changes in phenotype, but extends this concept to networks rather than lists of genes. Our work is directly motivated by a study describing the functional content of the S. cerevisiae genetic interaction network by Costanzo et al. [7]. The iconic first figure of this paper shows a network connecting genes with similar genetic interaction profiles and nodes highlighted according to their membership to functional groups defined by the Gene Ontology. Costanzo et al. find that genes displaying tightly correlated profiles form discernible clusters corresponding to distinct bioprocesses and that the relative distance between distinct clusters appears to reflect shared functionality [7]. This is an important observation, because it shows which cellular functions are associated with the genetic interaction network. If the network had been generated under different experimental conditions that activate different processes in the cell, these functional associations would most probably have changed. Using SANTA, it is possible to quantify the significance of these changes in functional association in a principled statistical way, something not previously possible.

SANTA rigorously implements an intuitive association measure
The roadmap for functional analysis of networks provided by Costanzo et al. [7] relies on assessing the clustering of selected nodes on the network. However, their analysis was done by eye and depends not only on the gene set and the network but also on the visualisation algorithm used. Clustering on a network is an intuitive measure of functional content, but without rigorous statistical methods the significance of observed patterns can not be assessed objectively. To address this problem, we have adapted well-tested concepts from spatial statistics [16] to define an objective and quantitative association measure between networks and gene sets. SANTA (spatial analysis of network associations) is built on the guilt-by-association principle: if a gene set shows a surprising degree of clustering on a network, we call them 'associated' ( Figure 2C); if the gene set is randomly distributed over the network, we call them 'not associated' ( Figure 2D).

Previous research
Enrichment analysis is a well-developed field and Khatri et al. [4] recently described three 'generations' of statistical methods, from over-representation analysis and functional class scoring, to pathway-topology based approaches. While methods in the last category [17][18][19] use pathway topology to compute gene-level statistics, none of them can be directly applied to measure the functional content of a network. Two related approaches, the compactness score of PathExpand [20] and EnrichNet [21], compare the clustering of two sets of genes on a network, but not the significance of the clustering of a single set. While the compactness score can be adapted to measure the significance of clustering, it focusses on the local structure of the network and can be less effective than SANTA to detect global effects, as we show below. Other approaches overlay interaction networks with genome-wide measurements and identify enriched subnetworks [22][23][24], to which enrichment analysis can then be applied in a consecutive step [25]. Again, subnetwork identification does not directly measure the association between gene sets and networks, and we show the effect of this difference in a comparison study.
In summary, no approach currently exists that, like SANTA, globally assesses the functional content of a network. In the following, we describe the methodology underlying SANTA and test its efficacy by applying it to both simulated and real data. Gene set enrichment methods have had a big impact on biological research and are used in almost every analysis of experimentally derived gene lists. The case studies we present in this paper show that SANTA has the potential to have a similar impact on all functional studies of network data.

Adapting Ripley's K-Function for networks
Spatial statistics model spatial correlation structures between observations (analogous to how time series analysis models the correlation between time points) [26]. One area of spatial statistics analyses point patterns and asks if points in R 2 are occurring at random or cluster together in any way. A basic tool for the analysis of point patterns is Ripley's K{function [16], which can be estimated by computing how many other points are captured by drawing a circle of radius s around each point: where n is the number of points, l is the density (number of points per unit area), d(i, j) is the distance between two points in R 2 , and I(d(i, j)vs) is an indicator function with value 1 if the distance d(i, j) between points i and j is smaller than s, and 0 otherwise. If the points are densely clustered, most of them will be Figure 1. Overview of SANTA. SANTA can be used both to quantify the strength of association between networks and sets of node weights (using K net ) and to prioritise genes for follow-up analyses (using K node ). Different node colour intensities represent different node weights. doi:10.1371/journal.pcbi.1003808.g001

Author Summary
Molecular networks are maps of the tens of thousands of interactions that occur between the components of biological systems. Types of interactions include physical, genetic and functional interactions between genes, gene products and metabolites. Network-based approaches to molecular biology are increasingly being used to better understand cellular functions. Currently, gene set methods can be used to functionally annotate the hits from highthroughput studies; however, no methods exist to functionally annotate molecular interaction networks. This greatly limits our ability to quantify the often subtle functional adaptions that occur in networks as they rewire to respond to external stimuli. Here, we extend well-tested concepts from spatial statistics to define a general association measure between networks and gene sets. Like Gene Set Enrichment Analysis, our approach measures concordant changes, but does this on networks, rather than on lists of genes. We validate it both in simulations and real-world case studies. We apply our approach to genetic interaction networks mapped under different conditions and created using different methods, and demonstrate how it extends the previous analyses of data sets, allowing us to better understand the high-level changes that occur within cells.
found for small values of s, while for uniformly spread points the K{function achieves larger values only for large values of s. Applications of the K{function in computational biology include detecting host factors involved in virus infection by observing the clustering of infected cells in siRNA screening images [27].
In our scenario, we measure observations at fixed locations (the nodes in the graph) instead of random locations in the plane. However, we can still adapt basic spatial statistics methodology: In the following paragraphs we will define a K{function for weighted networks, called K net , by first defining distance measures on graphs (instead of R 2 ) and then weighting each node by the strength of its phenotype. To adapt K(s) to networks, we formalised the problem using a node-labeled and edge-weighted graph G~(V ,E), where V is a set of n nodes (vertices) and E(V |V a set of m (undirected) edges between pairs of nodes (i,j). Node Graph distances. Distances between non-neighbouring nodes in this graph can be measured in many ways, including shortest path lengths, diffusion kernels [28] and the mean first-passage time [29], which are all implemented in the SANTA software package. There are subtle differences in the aspects of the network structure incorporated within each measure. For example, a shortest-path approach will only take into consideration the one path with the shortest length, no matter how many other paths exist between two nodes. A diffusion kernel, on the other hand, takes into account all paths and will yield a smaller distance the better connected two nodes are. The results produced by SANTA are generally robust across distance measures ( Figure S3), meaning that it often does not matter which method is chosen by the user. The shortest path distance method requires the least computational time and therefore we will mainly use this method in the paper. Efficient algorithms like Dijkstra's or Johnson's exist to compute shortest paths between all pairs of nodes [30] and are conveniently implemented in software packages like 'igraph' [31]. However, the diffusion-kernel based distance measure is used to identify enriched subnetworks, as this method is seen to produce denser subnetworks.
Many of the graph distance algorithms assume that small edgeweights correspond to stronger functional association between the two nodes. Many networks, however, are built by correlation analysis, where stronger functional associations are shown by a larger weight. Thus, in practice, the edge weights in a given molecular network often need to be reweighed to be used as graph distances fd e D e[Eg. Due to differences between the methods used to create each molecular network, it is necessary to use a different approach when reweighing the edges of each network. Edges are reweighed so that the strongest interactions have a graph distance of 0 and the weaker the interaction the greater the distance (see Methods).
Node weights. Exchanging the planar distance d( : , : ) in Equation (1) with a graph distance d g ( : , : ) directly results in a version of the K{function that is applicable if the node weights are in f0,1g. However, in many real situations, e.g. differential expression analysis or large-scale RNAi screens, the node weights are real numbers. In this case it is not only of interest how many 'hits' are close to each node, but also how strong these hits are. We implement this notion by weighting the contribution of each node by the relative weight it carries compared to the other nodes. This results in a function K n e t of the form where p i is the phenotype observed at node i and p p~1 n X n i~1 p i . This re-weighting is very similar in spirit to the re-weighting of the Kolmogorov-Smirnov statistic in GSEA [2]. Generally, we plot K n e t from 0 to the maximal distance within the graph (the diameter), in which case K n e t forms a curve starting and ending at 0 ( Figure S1).
Node-wise K-function. The inner sum of Equation (2) offers a natural way to prioritise nodes and identify candidate genes for the mechanisms underlying the observed phenotype. We define this node-wise K{score as the AUK of the node-wise K{function, defined as: Computing significance by permutation. To check how significant observed K n e t results are, we compare them to curves obtained by applying K n e t to sets of randomly permuted hits. These sets of permuted hits are obtained by randomly redistributing the node weights across the nodes. When permuting the node weights, it is not always possible to maintain the degree of each node, therefore, node degree is not considered when permuting the weights.
Since we want to quantify the amount of clustering, we are interested in observed K n e t -curves that ascend steeper than random curves. To quantify this, we compute for all curves (the observed K n e t and the N p e r m random permutations) the area under the K net -curve, or AUK value. An empirical p-value for the observed AUK is calculated using a Z-test. Figure 2 exemplifies the application of K n e t to two GO terms and the yeast genetic interaction network.

Simulation studies
SANTA successfully identifies clustering on simulated networks. Functional annotation is an exploratory task without a general gold standard. In order to test the ability of SANTA to correctly identify clustered distributions of node weights on networks, we conducted a number of controlled simulations. In each of these simulations, we created a network containing a cluster of high weight vertices of a known strength and applied the K n e t -function in order to determine whether it would successfully identify the clustering.
Each of the networks contained 500 nodes and was created using the Barabasi-Albert model of preferential attachment [32]. A seed node was chosen at random. All nodes in the network were ranked by their distance (using the shortest paths method) to the seed node and the s closest nodes chosen to be the sample set. A hit set was then created by choosing 5 nodes at random from the sample set. Different values of s (10, 20, 50, 100 and 500) were chosen to simulate different clustering strengths. A value of s equal to the number of nodes in the network is the same as randomly sampling nodes from the entire network.
As expected, SANTA identified more significant clustering when applied to hit sets created with smaller values of s ( Figure 3A). When nodes are randomly sampled from the entire network, the p-values returned by SANTA were uniformly distributed ( Figure 3B), as expected when the null hypothesis is true.
SANTA incorporates the global structure of a network for functional association. One of the main advantages of the K n e t -function is that it considers the global topology of a network when measuring the significance of clustering. This can be demonstrated by comparing the K n e t -function to an adapted version of the compactness score [20]. The compactness score of a gene set is the mean distance between pairs of nodes in the gene set. It is used by the PathExpand tool to compare the clustering strength of different sets of nodes [20]. By comparing the compactness score of an observed set of nodes to the compactness scores of permuted sets of nodes, it is possible to produce an empirical p-value describing clustering significance, much like the K n e t -function.
Many real-world networks follow a power-law degree distribution and contain nodes with both a small and large number of interacting partners [32]. If the genes in a gene set all have a large number of interacting partners, then the presence of interactions between the genes in the gene set could be considered less significant, as there is a greater likelihood that they would be observed by chance. Therefore, it is necessary to incorporate the global structure of the network and consider the number of nodes located near each node when quantifying clustering significance. Figure 4 demonstrates that while the K net -function incorporates the global structure of the network, the compactness score does not. The K net -function can also be applied to continuous distributions of node weights, while the compactness score can only be applied to binary sets, limiting its application. For these reasons, the K net -function is better suited to measuring the significance of clustering of node weights on real-world networks.
SANTA provides a complementary method of identifying enriched subnetworks. Next, we compared SANTA to approaches that overlay molecular networks with additional node information and identify a high-scoring subnetwork, using simulated and real data. A widely used example is BioNet [24], which identifies enriched subnetworks of nodes by fitting a betauniform mixture (BUM) model to the network in order to score nodes. Positive-scoring nodes are then aggregated and a minimum spanning tree calculated between these positive nodes. However, the presence of negative-scoring nodes between clusters can prevent BioNet from identifying multiple clusters. As the K n o d efunction considers each node individually, it is able to return highscoring nodes spread across multiple clusters. . Application of K net to simulated networks. Scale-free networks containing clusters of high-weight nodes of various strengths were generated. The smaller the distance cutoff used to generate the cluster, the greater the strength of the clustering. (A) 1000 trials were completed for each distance cutoff. As expected, the most significant clustering was measured by the K net -function when smaller distance cutoffs were used. (B) Q-Q plot of the p-values observed in the simulation study trials in which no distance cutoff was used and the pvalues expected under the uniform distribution. The high-weight nodes were distributed homogeneously when no distance cutoff was used. The observed p-values deviate little from the expected p-values, demonstrating that the K net -function does not detect clustering when clustering is not present. doi:10.1371/journal.pcbi.1003808.g003 We conducted a number of simulations in order to compare the abilities of K n o d e and BioNet to identify high-scoring nodes located within multiple clusters on a network. In each simulation, a network containing 1000 nodes was created using the Barabasi-Albert model of preferential attachment [32]. 2, 3 or 4 nodes from distant parts of the network were selected to seed the clusters. For each seed node, 10 nodes were selected at random under a probability distribution that ensured that the probability of being chosen decreased exponentially with the distance from the seed node (P(i)*10 {d g (k,i) , where k is the seed node). The selected nodes became the high-weight nodes and were assigned node weights from a truncated normal distribution with a mean of 0 and a standard deviation of 1|10 {6 , within the interval ½0,1. Unselected nodes were assigned node weights from the uniform distribution, again within the interval ½0,1. K n o d e and BioNet were then applied to the network. If x high-weight nodes are applied to the network, K n o d e is said to have successfully identified a high-weight node if it is ranked within the top x nodes. BioNet successfully identifies a high-weight node if it is contained within the returned enriched subnetwork. Figure 5 shows that K n o d e was able to successfully identify a greater proportion of labelled nodes than BioNet when 3 or more clusters were added to the network. BioNet tended to successfully identify nodes from a single cluster, but missed nodes contained within others. This highlights an advantage of SANTA over methods identifying a single top scoring subnetwork.

Real-world case studies
SANTA identifies functionally-informative enriched subnetworks. We also compared the K n o d e -function to BioNet by rerunning the validation experiment conducted by BioNet. Gene expression data from two subtypes of diffuse large B-cell lymphomas (DLBCL) was combined with survival data [33]. P-values were produced using Cox regression and these were converted into node weights which were used to annotate the HPRD interaction network [34]. BioNet and the K n o d e -function were applied in order to identify enriched subnetworks. BioNet returned a module containing 38 genes and 49 interactions. In order to make a fair comparison, the 38 genes ranked highest by K n o d e were chosen to form the K n o d e module. This module is denser than the BioNet network and contains 86 interactions. Only 7 genes were identified by both BioNet and K n o d e . The BioNet module is enriched with genes involved in the oncogenic NFkB pathway [24]. Fisher's exact test was used to identify functional gene sets enriched within the modules [35]. While the K n o d e module is not enriched with NFkB pathway genes, the cancer-associated GO term 'regulation of apoptosis' was identified as the most stronglyenriched gene set (pv1|10 {7 ).
These results demonstrate that the K n o d e function represents a complementary method of enriched subnetwork identification. However, the main purpose we envision for SANTA is to annotate the functional content of networks and the next case studies focus on this task.
Correlations in GI profile produce functionally more informative networks. For further validation, we applied SANTA to the global genetic interaction (GI) network of S. cerevisiae, where there is evidence that protein function is more closely related to the global similarity between GI profiles than to individual interactions [7]. To measure this effect we contrasted the functional content of a network of high correlations between GI profiles with a network of individual GIs. This was done by quantifying the strength of association of sets of functionally related genes with each of the networks using the K n e t -function. Sets of functionally related genes were obtained from the Gene Ontology (GO). To ensure that the functional sets were not too thinly or thickly spread, only GO terms associated with between 20 and 100 network genes were tested. Figure 6A shows that GO terms indeed tend to cluster more strongly on the correlation network than on the network of individual GIs, demonstrating that similarity between GI profiles is a stronger indication of shared protein function. This effect was independent of the GO term size and strongest for specific cellular functions like 'structural constituent of ribosome', 'cytosolic small ribosomal subunit' and 'piecemeal microautophagy of nucleus' (Table S1).
Yeast interaction networks functionally rewire under external stress. Most studies have mapped GIs in cells under normal laboratory conditions [7,8,36]. However, it has been demonstrated that GIs can be condition-dependant [37]. Mapping GI networks under multiple conditions is therefore likely to reveal new information about how a cell reorganises itself to cope with environmental conditions. To measure these effects, we used SANTA to analyse the changes in functional content that occur in S. cerevisiae GI networks under external perturbation by the DNA-damaging agent methyl methane-sulfonate (MMS) [10] and UV radiation [38]. We again used the association strength of GO  term-associated gene sets to quantify functional enrichment within each network. By comparing the association strengths of the GO terms between the treated and untreated networks, it is possible to identify pathways and processes that are up-and down-regulated in response to the changes in environmental condition. GO terms were applied to each network if they associated with between 20 and 100 genes.
We found several GO terms that associated more strongly with the MMS-treated network than the untreated network ( Figure 6B and Table S2). GO terms related to the response to DNA-damage, including 'DNA repair', 'response to DNA damage stimulus' and 'covalent chromatin modification', associated more strongly with the MMS-treated network. This result is expected and found in the original publication, thereby providing further validation for SANTA.
Comparing the functional enrichment of the UV-treated network replicates the finding of the original publication as well as identifying subtle changes not reported in the publication ( Figure 6C and Table S3) [38]. The top 10 GO terms most strongly enriched within the UV-treated network are related to DNA-damage repair or cell cycle progression; processes known to be affected by exposure to UV radiation [39]. However, the K n e tfunction is also able to identify processes affected by UV-treatment not reported in the original publication. 'Chromatin silencing at telomere' associates more strongly with the untreated network (pv1:6|10 {8 ) than the treated network (pv3:4|10 {5 ). It has previously been demonstrated that some of the proteins involved in transcriptional silencing at the telomeres, such as Sir and Ku, are also involved in DNA-damage repair [40] and are dispersed from the telomeres in response to DNA damage [41]. Our results provide further support for this hypothesis and demonstrate that the K n e t -function is able to provide insight into the functional repurposing of cells that cannot be provided by current methods.
The strength of gene set association was independent of gene set size ( Figure S2). Association strength is also robust across distance methods ( Figure S3). SANTA identifies functional adaptions not seen in the original analysis and thereby also provides a method of hypothesis generation. The advantage of SANTA is that it directly contrasts the functional content of the two networks, which improves on the indirect enrichment analysis of differing edges in the original analysis [10].
Interaction networks provide different levels of information about cancer cell line maintenance. Different networks describe different aspects of cellular machinery: co-expression networks describe transcriptional effects, protein interaction networks describe complexes and genetic interaction networks describe epistatic buffering relationships. Identifying the type of network that associates most with genes of interest can point to the mechanism underlying observed phenotypes. To exemplify this idea, we used SANTA to associate RNAi screens in cancer cell lines [42] to a curated network of physical interactions [43] and to a functional interaction network created by combining 21 data sources from 4 species [44], with the aim of identifying the network that best explains the phenotype. The colon and ovarian cancer cell line RNAi hits were seen to associate more strongly with the functional interaction network (Figure 7), indicating that it is possible to create a network that better explains the mechanisms that maintain cancer cell line viability by combining multiple data sources.

Discussion
SANTA is a general approach for functional annotation that extends enrichment analysis from gene sets to networks. SANTA combines the guilt-by-association principle, which is one of the most powerful paradigms for function prediction, with well-tested concepts adapted from spatial statistics. In this way, SANTA provides a rigorous implementation of an intuitive measure of functional annotation. We have applied SANTA to several datasets from different organisms and our results show how SANTA rigorously addresses the basic question of which functional processes are reflected in a network.
In yeast, our results on genetic interactions support the idea that a strong correlation of GI profiles between two genes is a greater indicator of shared function than the presence of a single GI. The reason for this increase in functional information is most probably that individual GIs don't bear much evidence for underlying mechanisms, while having many GI partners in common is strong evidence for genes acting in the same pathway or complex [45]. Additionally, Costanzo et al. [7] noted that their network captured only 35% of the previously reported interactions, indicating that a large number of false positives and false negatives may be present within GI networks. Networks of correlations in GI profile may be more robust to the high number of errors that are present when GIs are mapped.
Extending these results to networks rewired under external stimulation, we show how SANTA quantifies subtle functional changes. In humans, we showed how SANTA can contribute to understanding the mechanisms underlying large RNAi screens. Testing the association of hits with many different networks (transcriptional, proteomic, genetic) can help us to understand which cellular mechanisms underly the phenotypes. In summary, our results support that SANTA accurately quantifies the functional content of networks, points to mechanisms underlying observed phenotypes, and provides a natural way to compare functional changes across networks.
We expect SANTA to contribute mostly to the functional annotation of networks derived under different environmental conditions (like the GI networks we used as case studies here). However, SANTA is a very general approach and the examples we presented here also show other uses: it can also be used to annotate RNAi hits (if different functional networks are available) and prioritise individual hits over others (using K n o d e ). In the future, we see many further opportunities for applying SANTA. For example, new methods of automated, single-cell phenotyping measure genetic interaction networks across a broad spectrum of phenotypes [9] and a functional annotation method like SANTA could have great impact on understanding which cellular processes are reflected in which phenotype. Another potential application for SANTA lies in network-based medicine, where drug development for complex diseases is developing towards targeting dynamic network states [46][47][48] and network-based analysis has identified cancer subtypes [49]. Functional annotation of these networks will further our understanding of the biology underlying these diseases.
Gene set enrichment analysis is the first step in the unbiased analysis of most experimentally derived gene lists and we expect SANTA to have a similar impact on all functional studies of network data.

Shortest paths distance measure
There are a number of different methods available to calculate the distance between a pair of nodes in a network. One of the simplest methods involves identifying the shortest path connecting the node pair and using the length of this path. The shortest paths distance measure can be applied to networks with or without weighted edges. In unweighted networks, the shortest path is equal to the number of edges included within the shortest path. In weighted networks, it is the sum of the edge weights along the shortest path.
A number of different algorithms are available to compute the shortest path between two nodes. Which algorithm is most efficient depends on the type of network being analysed. If no edge weights are present, then the breadth-first search algorithm is ideal. If edge weights are present and each edge weight is non-negative, then Dijkstra's algorithm is more efficient [30].

Diffusion kernel-based distance measure
The diffusion kernel-based distance measure is another method of calculating distances between pairs of nodes [28]. An advantage of the diffusion kernel-based method over the shortest-paths method is that whilst the shortest-paths method calculates the distance along a single path, the diffusion kernelbased method incorporates distances along multiple paths. Like the shortest paths method, the diffusion kernel-based method can also be applied to networks with or without edge weights. One interpretation of the method is the continuous time limit of a random walk across the network, resulting in highly-connected nodes being associated with smaller node pair distances.
The negative graph Laplacian (H) is used to create a diffusion kernel for the network. H is a square matrix of size V |V with entries: H is specified for networks with and without edge weights. i*j indicates that node i and node j are connected by an edge and i=j indicates that they are not directly connected. d i is the number of edges associated with node i (the degree of node i). w ij is the weight of the edge connecting nodes i and j. The bH The fact that H is diagonalizable (H~UDU {1 ) makes it easier to compute D. If D is a diagonal matrix with entries (d i ) i~1,...,n , then D~exp (bH)~U exp (bD)U {1 . exp (bD) is a diagonal matrix with entries ( exp (bd i )) i~1,...,n [28].
Mean first-passage time-based distance measure Mean first-passage time (MFPT) can also be used to compute the distances between pairs of nodes [29]. The MFPT-based measure is similar to the diffusion kernel-based measure in that it can be compared to completing a random walk across the network. The MFPT of a walk from node i to node j (m i,j ) represents the expected number of steps required to reach node j for the first time: i, j is the probability that the random walk reaches node j for the first time after n steps. The MFPT between each node pair can be computed analytically using the equations: where I is the identity matrix, E is a matrix with equal dimensions containing only 1s, e is a column vector containing only 1s, p is a column vector of the stationary distributions of the Markov chain, A is the Markov chain transition matrix and D is a diagonal matrix with elements:

Costanzo et al. yeast GI networks
Costanzo et al. tested for genetic interactions (GI) between 5.4 million gene pairs in S. cerevisiae using synthetic genetic array (SGA) analysis [7]. Using this data, we created two interaction networks: the first from raw GI scores (E) and the second from correlations in interaction profile. The raw interaction network contains both positive (Ew0:16) and negative (Ev{0:12) interactions (78,701 interactions between 4,326 genes). GI scores were converted into edge distances by calculating: The correlation network was created by computing, for each gene pair, Pearson's correlation coefficient for the respective rows of the complete GI matrix. Pairs of genes were connected in the network if their interaction profile correlation coefficient exceeded a threshold. Using a threshold of PCCw0:125 ensured that the correlation network contained a similar number of interactions to the raw network (76,434 interactions between 4,326 genes). Correlation coefficients (c e ) were converted into edge distances by calculating: Bandyopadhyay et al. yeast GI networks 174,000 gene pairs were tested for interactions in MMS-treated and untreated S. cerevisiae [10]. Modified T-tests were used to compare the growth rate of the observed double mutant against the rate expected given that no interaction exists. We previously demonstrated that a strong correlation in GI profiles is a greater indicator of shared function than raw interactions. Therefore, we created a correlation network for each condition by computing Pearson's correlation coefficient for each gene pair. A threshold of PCCw0:3 for the MMS-treated network and PCCw0:25942 for the untreated network was applied to ensure that each network contained an equal number of interactions (3067 interactions between 419 genes). Correlation coefficients were converted into edge distances using Equation 12. 45,000 gene pairs were tested for interactions in S. cerevisiae treated with high doses of UV radiation (80J=m 2 ) and untreated S. cerevisiae. Modified T-tests were used to produce interaction scores (S) for each of the gene pairs. Too few gene pairs were tested to build a GI correlation network and therefore networks of raw interactions were created. Pairs of genes were connected in the networks if Sw1:25 or Sv{1:25. The UV-treated network contains 5,799 interactions between 1,406 genes and the untreated network contains 6,270 interactions between 1,406 genes. Interaction scores were converted into edge distances by calculating:

IntAct physical and genetic interaction network
IntAct is an open source database for molecular interaction data [43]. H. sapiens data from the database was downloaded on 2013-05-02 to create the biological network used in Figure 7. This network contains 6,856 genes and 21,291 interactions. No confidence scores were available for the interactions and therefore no edge distances are associated with the network.

HPRD physical interaction network
The Human Protein Reference Database is a database of physical and functional interactions between genes and proteins [34]. The HPRD network was downloaded from the R package DLBCL, version 1.3.7 [50]. To allow for comparison of the K n o d e function to BioNet, only the largest cluster of interacting genes was used. The final HPRD network contains 7,756 interactions between 2,034 genes.

HumanNet functional interaction network
HumanNet is a functional network that combines 21 sources of genomic and proteomic data from four species to build a human-specific biological network [44]. These sources of data include gene co-citation, gene co-expression, curated physical and genetic interactions, high-throughput physical and genetic interactions, co-occurrence of protein domains and bacterial orthologs from C. elegans, D. melanogaster, H. sapiens and S. cerevisiae. Version 1 of the database was used to create the biological network used in Figure 7. Log likelihood scores were provided for each of the interactions. To reduce the density of the network, interactions with log likelihood scores less than 2 were removed from the network. This network contains 8,475 genes and 58,636 interactions. Log likelihood scores LLS e were converted into edge distances by calculating: Cancer cell line RNAi hits RNAi technology can be used to identify genes essential to the survival of cancer cell lines. Cheung et al. performed genome-wide RNAi screens of 102 cell lines across 6 cancer types: oesophageal squamous cell carcinoma, glioblastoma (GBM), non-small-cell-lung cancer (NSCLC), pancreatic cancer, ovarian cancer and colon cancer [42]. 11,194 genes were targeted. The weight of evidence approach was used to compute essentiality scores for each shRNA for each set of cancer cell lines [42]. GENE-E (http://www.broadinstitute.org/ cancer/software/GENE-E/index.html) was used to collapse the shRNA-wise essentiality scores into gene-wise p-values. P-values are produced by permuting the shRNA scores 10,000 times in order to create artificial genes. The second best score of the shRNA within these artificial genes is then compared to the second best observed shRNA score. Gene-wise p-values s v were converted into node weights p v by calculating:

DLBCL gene expression and survival data
Gene expression data for two subtypes of diffuse large B-cell lymphomas (DLBCL): germinal center B-like phenotype (GCB, 112 tumors) and activated B-like phenotype (ABC, 82 tumors), was obtained from the R package DLBCL, version 1.3.7 [50]. This package also contains data on patient survival. The data originally comes from a study of patient survival after chemotherapy [33]. Pvalues for differential expression and risk association were produced using Cox regression. These p-values were combined using second-order statistics in order to produce gene-wise association scores which could be applied to the networks. Gene-wise p-values were converted into node weights using Equation 15.

Gene Ontology database
The Gene Ontology (GO) database consists of a hierarchical structure of gene annotations [1]. Annotations from this database were used in Figure 6. The GO database consists of 3 top-level ontologies: molecular functions, biological processes and cellular components, all of which were used in each figure. S. cerevisiae GO term annotations were retrieved from the Saccharomyces Genome Database (www.yeastgenome.org) using the R package org.Sc.sgd.db, version 2.10.1 [51].

Compactness score
The compactness score C is defined as the mean shortest path distance between pairs of nodes in a set P on graph G [20].
In order to measure the significance of the observed compactness score, we compared it to scores produced using sets of randomly permuted hits. An empirical p-value for the observed compactness score is calculated using a Z-test.

Implementation
The methodology described in this work has been assembled as an R package called SANTA, which is available for download at http:// bioconductor.org/packages/release/bioc/html/SANTA.html. This package is distributed with the code (in the form of a vignette, Text S1) and the data required to reproduce all of the results given in this paper. The vignette also contains the parameters used with the Barabasi-Albert model of preferential attachment to create the simulated networks. The running time of SANTA depends on the size of the network and the number of permutations being run. Using 1000 permutations, SANTA requires 1GB of RAM and 25 seconds on a single Intel Xeon E5-2640 to measure the strength of association of a single gene set on the raw Costanzo et al. GI network (78,701 interactions between 4,326 genes). SANTA can use parallel computing (where available) to reduce running time. Figure S1 Comparison of Ripley's K-function and K n e t . (A) Ripley's K-function (K(s)) counts how many points on a plane are captured within circles of increasing radius (s) around each point. Here, circles are drawn from only a single point (red circle). (B) The graph of K(s) for the distribution of points in (A). If the clustering of points were greater, then the K(s) function would increase faster and the area under the curve (AUK) would be greater. (C) The K n e t -function computes the absolute deviation of the sum of the weight of nodes within a certain distance of each node from the Null model. The distance from a single node (red circle) is shown. The darker the colour of the node, the greater its weight. (D) The graph of the K n e t -function for the network and node weights in (C). The greater the clustering of the node weights on the network, the greater the AUK. (TIF) Figure S2 Correlation between set size and K n e t pvalue. Plot of the significance of the clustering of sets of network genes associated with a GO term against the set size on GI networks mapped in (A) untreated yeast and (B) yeast treated with the DNA-damaging agent MMS. Only those GO terms that associate with either or both networks with a strength of pv0:001 are shown. Many GO terms share a large number of genes due to their ontological relationship. When those GO terms that are ancestors of other GO terms tested are removed, Pearson's correlation coefficient equals 0.004 for the treated network and 2 0.040 for the untreated network, demonstrating that there is little correlation between set size and K n e t p-value. (TIF) Figure S3 Correlation in network-gene set association strength between distance methods. Pair-wise comparison of the association strengths of GO terms across the three distance methods. The networks tested were the MMS-treated (Top) and untreated (Bottom) S. cerevisiae GI networks created using data from Bandyopadhyay et al. Association strength correlation across networks is very high (PCCw0:98), demonstrating that the results produced by SANTA are generally robust across distance methods. (TIF)

Supporting Information
Table S1 GO terms differentially associated with a network of raw GIs and GI profile correlations. K n e t was used to test the strength of association between sets of genes associated with various GO terms and the two network types. This table contains the GO terms that associated most strongly (pv1|10 {8 ) with one or both of the networks. GO terms are ranked by their differential association strength (D), with the terms associated more strongly with the network of GI profile correlations positioned towards the top and the terms associated more strongly with the network of raw GIs positioned towards the bottom. A greater number of GO term genes associated more strongly with the network of GI profile correlations. (PDF) Table S2 GO terms differentially associated with the untreated and MMS-treated GI networks. K n e t was used to test the strength of association between sets of genes associated with various GO terms and the two network types. The table contains the GO terms that associated most strongly (pv0:001) with one or both of the networks. GO terms are ranked by their differential association strength (D), with the terms associated more strongly with the treated network positioned towards the top and the terms associated more strongly with the untreated network positions towards the bottom. (PDF) Table S3 GO terms differentially associated with the untreated and UV-treated GI networks. K n e t was used to test the strength of association between sets of genes associated with various GO terms and the two network types. The table contains the GO terms that associated most strongly (pv0:001) with one or both of the networks. GO terms are ranked by their differential association strength (D), with the terms associated more strongly with the treated network positioned towards the top and the terms associated more strongly with the untreated network positions towards the bottom. (PDF) Text S1 Vignette containing details of how to reproduce the results given in this paper.

(PDF)
Author Contributions