Passing Messages between Biological Networks to Refine Predicted Interactions

Regulatory network reconstruction is a fundamental problem in computational biology. There are significant limitations to such reconstruction using individual datasets, and increasingly people attempt to construct networks using multiple, independent datasets obtained from complementary sources, but methods for this integration are lacking. We developed PANDA (Passing Attributes between Networks for Data Assimilation), a message-passing model using multiple sources of information to predict regulatory relationships, and used it to integrate protein-protein interaction, gene expression, and sequence motif data to reconstruct genome-wide, condition-specific regulatory networks in yeast as a model. The resulting networks were not only more accurate than those produced using individual data sets and other existing methods, but they also captured information regarding specific biological mechanisms and pathways that were missed using other methodologies. PANDA is scalable to higher eukaryotes, applicable to specific tissue or cell type data and conceptually generalizable to include a variety of regulatory, interaction, expression, and other genome-scale data. An implementation of the PANDA algorithm is available at www.sourceforge.net/projects/panda-net.


Final Network Validation:
To validate our predicted regulatory network, we downloaded ChIP-chip experiment data, that provide genome-wide information of the in vivo binding sites of the TFs [1]. TF targets were defined using the criterion p<0.001. We used this information to construct a "gold-standard" for our regulatory network as follows: 0  10  with  gene  of  region  control  the  to  binds  TF  if  1 3 ) ( (2) When evaluating the quality of a network against this standard we only used TFs and genes included in the standard. In other words, when calculating the AUC we excluded edges connected to TFs for which the rows of W (G) sum to zero and edges connected to genes for which the columns of W (G) sum to zero.
where x jl is the expression level of gene j in condition l and . j x is the average expression level of gene j across all conditions. This created three initial co-regulatory networks, one for each expression dataset. Note that a gene is always co-expressed with itself, therefore C jj =1. In the rare absence of expression data, this assignment can be used to initialize C to an identity matrix. It also allows us to account for genes in our model that are the only target of a TF and thus are only co-regulated with themselves and not other genes.

Final Network Validation:
We used experimentally-derived binding sites from ChIP-chip [1] to construct a "gold-standard" to evaluate our co-regulatory network. For every gene-pair, we determined if those two genes were ever targeted by the same transcription factor in the ChIP data and defined the network as follows: 0   10  with  and  genes  both  of  regions  control  the  to  binds  TF  any  if  1 (4) When evaluating the quality of a network against this gold standard we only used genes with information in the standard. In other words, when calculating the AUC we excluded edges connected to genes for which the rows of C (G) sum to zero.

Protein-Protein Interaction Data (Protein-Cooperativity Network)
Data: We downloaded protein-protein interactions from NCBI [10]. This repository includes interactions reported in the BioGrid database [11,12].

Initial Network Construction:
We filtered the data set to include only physical interactions between the 53 TFs in the motif network (see above) with evidence from high-throughput Affinity Capture-MS experiments, and defined the initial network as: We also set P ii =1, since a TF can be viewed as co-operating with itself. In the absence of protein-protein interaction data, P is initialized to the identity matrix.
Final Network Validation: Separately, we filtered the protein-protein interactions listed in the above database to include only physical interactions between the 53 TFs in the motif network with evidence codes from "lowthroughput" (and more stringent) experiments, including "co-fractionation," "co-localization," "FRET," and "reconstituted complex." These interactions define our "gold-standard": As with the regulatory and co-regulatory networks, the quality of the final networks was only evaluated using edges connected to TFs for which we had information in our gold-standard, thereby excluding from the evaluation any edges connected to TFs for which the rows of P (G) sum to zero.

Initial Network Normalization
In order to pass information between data types that are intrinsically different, C (0) , W (0) and P (0) are all normalized by converting to Z-scores, as follows: where ) 0 ( pq X represent the entry in the p th row and q th column of data type X. For example if X=W then, for an edge W ij in W: and furthermore, where μ i and σ i represent the mean and standard deviation across row i in W. This transformation is integrated into the PANDA algorithm and occurs only once for each data typeimmediately after network initialization and before the first message is passed.

Finding Agreement Between Two Networks
Before discussing the specifics of the PANDA algorithm, we wish to offer some conceptual motivation for the mathematical approach. PANDA aims to find agreement between the data represented by two networks. In other words, we give more weight to an edge if there is a large amount of agreement between data representing that edge, and subtract weight from an edge if there is a large amount of disagreement between data representing that edge.
With this in mind, we use a similarity metric in order to evaluate the potential weight of an edge in each of our networks. We remind the reader that in order to put the networks representing each of our datasets on the same scale, we normalized each by recasting edge weights into z-score units (see Equations 7-9). As a result, we need our metric to take two vectors representing z-scored edge weights and return a score whose value is also in z-score units, or at least whose value can be interpreted in much the same way as the values of the input vectors.
We base the similarity score used by PANDA on the Tanimoto similarity: This metric will return a value of one for perfect agreement and a value close to zero for no agreement. We modify the above equation in two simple ways such that, given x  and y  in z-score units, the similarity value returned will also mimic z-score units. First, we force the distribution to be symmetric around 0 (in other words, for y y ) by the addition of absolute value signs around the final dot product in the denominator. Then we add a square root around the entire denominator such that the returned values will no longer be strictly bounded between [-1,1] but may take any real values. Namely: One can immediately observe that, given x  and y  in units of z-score, the numerator is in units of z-scoredsquared and the denominator is in units of z-score, thus T Z will be in units of z-score (instead of unit-less as in Equation 10). We also note that the maximum and minimum values that can be obtained when z-scoring a vector of values are equal to 1  N and 1   N , respectively, and occur when all entries of a vector except one have the same value. We would like T Z to return these maximum and minimum values when x and y either perfectly agree or perfectly disagree. We point out that, because variance of a vector in units of z-score is equal to one: Using this identity, we observe that for , which approximates these maximum and minimum values. Of course, because we simultaneously z-score each row and column of the initial networks (see Equations 7-9), the values are only approximately in z-score units; however, we believe that the above gives an intuitive way to interpret the similarity measure values.
As the contribution of the denominator is mostly to keep the values of T Z in the same range of values as zscores, one might desire to simplify the mathematical form of this equation, for example, by introducing a factor of 2 in front of the last element of the denominator (compare Equation 11 to Equation 13). Although prettier to write, this additional factor mitigates some of the properties of T Z described above. For example, instead of being bounded, there is now a singularity at In principle this is not an issue, but in practice, especially as we expect to find agreement between the networks derived from real data, and furthermore, the messagepassing procedure described below actually updates each network to make it more in agreement with the others, this additional factor of two also necessitates the addition of a small value, ε, to the equation to avoid computational anomalies: We have run PANDA (see below) using both T Z and T 2 (using ε =1e-10) and observe little difference between the results (see Supplemental Figure 2D). Therefore, to avoid the potential singularity, we prefer using T Z (Equation 11) but inform the reader that either T Z or T 2 will produce networks that are highly informative of a given biological system.

PANDA: Passing Attributes between Networks for Data Assimilation
After constructing initial networks for each data type, we pass messages between these networks in order to inform and update the values of their edges to represent more functionally relevant predictions. The messagepassing occurs in two steps: (1) estimate and update the regulatory network, and (2) estimate and update the data-specific (co-regulation and protein-cooperativity) networks. These updates are iteratively repeated until convergence.

Estimate and Update the Regulatory Network
Using Protein Interactions to Predict the Responsibility of Regulatory Relationships: When regulating a gene, transcription factor proteins rarely act alone. Instead, they often interact to form complexes that, in turn can bind to the control regions of genes and recruit transcriptional units such as RNA polymerase at a level unattainable by any individual portion of the complex. The structure of complexes allows transcription factors to influence the expression level of a gene in the absence of a physical binding site in the control region of that gene.
We assume our protein-protein interaction network represents cooperative regulation by TFs wherein the strength of an edge between two TFs is indicative of how often those two transcription factors work together to regulate the expression levels of genes. With this in mind, we heuristically combine the regulatory network (W) with the protein-cooperativity network (P) to predict the responsibility (R ij (t) ) of an edge from TF i to gene j in the regulatory network. Namely, since TFs that cooperatively regulate their targets share responsibility for the behavior of these genes, to determine the responsibility an individual TF has in regulating a particular target gene, we determine the level of agreement between the TFs that target gene j (W .j (t) ), and those that form a complex with TF i (P i.

(t)
): This formula was motivated by our desire to approximately maintain the resulting values to have the standard normal distribution. Intuitively, this is achieved by having the numerator in units of Z-scored-squared (rows and columns of P and W are approximately standardized, see Equations 7-9) and the denominator in units of Zscore. For example, if row i of P and column j of W are equal, then R ij will be equal to the magnitude of this row/column in P/W. The same intuitive approach is used for other terms, including the availability (Equation 15), the co-regulatory network (Equation 17) and the protein-cooperativity network (Equation 18).

Using Co-regulation to Predict the Availability of Regulatory Relationships:
Correlation in the expression profiles of two genes can be indicative of many types of relationships between those genes, including both direct (i.e. regulation of a gene by a transcription factor) and indirect relationships. We treat expression correlation as a "co-regulation" network, where values indicate the degree to which two genes are targeted by the same set of transcription factors. Under this assumption, we combine information in the regulatory network (W) with the co-regulatory network (C) to predict the availability (A ij ) of an edge between TF i and gene j in the regulatory network. Namely, since genes that are targeted by the same TF are co-regulated, at each iteration, t, to calculate A ij (t) we determine the level of agreement between the regulatory targets of TF i (Wi .

(t)
) and the set of genes with which gene j is co-regulated (C .j (t) ): Combining the Availability and Responsibility and Updating the Regulatory Network: Since regulation requires both that a TF is responsible for the regulatory status of its target gene and that the target gene is available to be regulated by that TF, we use the average of these two values ( ) and update the regulatory network by a small amount (α; 0<α<1): where the above equation represents the weighted average between the previous estimate for the regulatory network and the current estimate. The update parameter, α, can therefore be used to tune how quickly messages are being passed. We have found that in practice results are fairly stable for 0<α<0.2 (see Supplemental Figure 2B). In order to determine converge, at each iteration step we also calculate the hamming distance between the current and estimated network: Where N is the number of possible edges in the regulatory network, or the product of the number of genes times the number of TFs considered by the algorithm. In this work we define convergence when H (t) is less than 10 -5 .

Incorporating Further Data-types Into the Regulatory Network Estimate:
The above process can be expanded or modified to incorporate additional or different data-types that enhance our ability to estimate either a TF's responsibility in regulating a gene, or a gene's availability to be regulated by a TF. Additional data-types could be included by estimating additional values for either the availability or responsibility of the edges and merging these values together when updating W. Data-types besides the ones outlined here may also be used to estimate the availability and responsibility of the edges in the regulatory network. In these cases the exact formulation of equations above may need to change to better represent the information in these data types.

Estimate and Update the Co-regulatory and Protein-cooperativity Networks
We not only pass messages between TFs and their targets, but also incorporate information from the regulatory network into the information represented in the co-regulation and protein-cooperativity networks.

Using Regulatory Relationships to Predict Co-regulation:
The co-regulation network is initialized based on gene expression correlation. We admit, however, that the assumption that co-expression is equivalent to coregulation is a simplification. Since co-regulated genes are, by definition, targeted by the same TFs, we can improve the quality of our co-regulation network and estimate the weight of an edge between two genes, j and k, in the co-regulation network (C jk ) by comparing the set of TFs targeting gene j (W .j ) with the set of TFs targeting gene k (W .k ): Since a gene is, by definition, always co-regulated with itself, we use the self-co-regulation of a gene j (C jj ) to represent the amount of certainty we have in the edges surrounding that gene.
where N G is the number of genes queried by PANDA and σ j is the standard deviation across C .j . In principle, since we are dealing with Z-scores, σ j should be equal to one, however in practice we have found that it can vary enough to interfere with the message-passing. The value of ) ( t jj C increases as messages are being passed such that eventually the self-regulation will dominate the availability equation (Equation 15), , and the algorithm will converge around some estimate of edges.
Using Regulatory Relationships to Predict TF Interactions: While a large number of TF interactions have been identified in various experimental assays, only a small fraction occur in vivo and are functionally relevant, while still other TFs may not directly physically interact, but still may need to both be present for the activation of their target genes. Since TFs that target the same sets of genes are likely to co-operate together to regulate those genes, we can estimate the weight of an edge between two TFs, i and m, in the protein-cooperativity network (P im ) by comparing the set of genes regulated by TF i to those regulated by TF m: Since a TF, by definition, always regulates the same set of genes as itself, we use the self-cooperativity of a TF (P ii ) to represent the amount of certainty we have in the edges surrounding that TF.
N T is the number of TFs queried by PANDA and σ i is the standard deviation across P i. . In principle, since we are dealing with Z-scores, σ i should be equal to one, however in practice we have found, especially for smaller samples of TFs, that it can vary enough to interfere with the message-passing. The value of ) ( t ii P increases as messages are being passed such that eventually self-cooperation will dominate the responsibility equation , and the algorithm will converge around some estimate of edges.
Network Updates: This process gives estimates for the co-regulation and protein-cooperativity networks that are in agreement with what is known about the regulatory network. We use these values to update C and P: (20) The same value for α as was used to pass-messages to the regulatory network is also used to pass messages to these networks.

Network Validation
We evaluated the quality of the predicted networks using AUC-ROC (AUC for short) statistics. When running PANDA, the initial "seed" networks (Equations 1, 3 and 5 above) already have a quality greater than random. Therefore we evaluated the significance of the improvement in the AUC of the final predicted networks compared to the initial "seed" networks using a jackknife procedure in which we removed motif, interaction and expression data regarding a random 10% of TFs and 10% of genes and ran PANDA on the remaining data. We repeated this 100 times in order to generate a distribution of initial and final AUC scores for each network.
We determined the significance of the difference in these two distributions by taking the pair-wise difference between the AUCs of the final and initial networks. We assumed that these differences should follow a normal distribution, and likewise determined their mean and standard deviation. The significance of the difference between these values and that of no improvement (a difference value of zero) was calculated by taking the ratio of the mean over the standard deviation and determining the CDF of this value compared to a normal distribution with a mean of zero and standard deviation of one.

Identifying Condition-Specific Regulatory Information and modules:
We took the union of the top 1000 edges predicted by PANDA in each of the condition-specific regulatory networks to form an integrated genome-wide regulatory network for yeast (results of the analysis are similar if we take other edge cutoffs). We defined condition-specific subnetworks as the edges that appear uniquely in only one of these three edge sets. To analyze the functional properties of these condition-specific subnetworks, for each subnetwork, we took all genes or transcription factor connected any edge in the subnetwork. We performed functional analysis using the DAVID bioinformatics tool on these genes, using the 2,555 genes from our seed networks as a background.
Next, to determine specific gene-level condition-specific information within this unified network, we identified TFs and genes for which at least one-half of their predicted regulatory interactions in the unified network are identified in only one of the condition-specific subnetworks. Among these TFs and genes we selected those that have an overall connectivity of at least 10 or 3, respectively.
We wanted to visualize the regulatory information surrounding these genes and TFs. For each expression dataset, PANDA produces three predicted networks: co-regulatory, regulatory, and cooperativity. We identified condition-specific edges from the co-regulatory and cooperativity networks by thresholding based on the final edge-weight value of P and C and selecting the top 10% of edges. For consistency, we used the 1000 edges selected in the above analysis to represent the regulatory network. Finally, we build regulatory "modules" for each specific expression condition by taking the subset of these selected edges that extend between the condition-specific genes identified above. We then visualized these genes and edges to illustrate how PANDA uncovers regulatory information that can be very specific to the expression conditions used in building the model.

Running and Characterizing the Results of Other Network Reconstruction Algorithms
In addition to PANDA we ran and characterized the results of three other widely-cited network reconstruction algorithms: SEREND [13], ReMoDiscovery [14], CLR [15] and C3Net [16]. All four were implemented using default parameter settings.

CLR:
The CLR algorithm [15] was downloaded from the authors' website [17] and run using default parameters. We used the same gene expression data that we fed into the PANDA algorithm (a 2,555×N C matrix where N C is