Self-Correcting Maps of Molecular Pathways

Reliable and comprehensive maps of molecular pathways are indispensable for guiding complex biomedical experiments. Such maps are typically assembled from myriads of disparate research reports and are replete with inconsistencies due to variations in experimental conditions and/or errors. It is often an intractable task to manually verify internal consistency over a large collection of experimental statements. To automate large-scale reconciliation efforts, we propose a random-arcs-and-nodes model where both nodes (tissue-specific states of biological molecules) and arcs (interactions between them) are represented with random variables. We show how to obtain a non-contradictory model of a molecular network by computing the joint distribution for arc and node variables, and then apply our methodology to a realistic network, generating a set of experimentally testable hypotheses. This network, derived from an automated analysis of over 3,000 full-text research articles, includes genes that have been hypothetically linked to four neurological disorders: Alzheimer's disease, autism, bipolar disorder, and schizophrenia. We estimated that approximately 10% of the published molecular interactions are logically incompatible. Our approach can be directly applied to an array of diverse problems including those encountered in molecular biology, ecology, economics, politics, and sociology.


Supplement A. Arcs update
In this section, we consider different schemes of conditional probabilities for updating one arc, A ij given its adjacent nodes, V i and V j .

Arcs update version A
(1) (2) where symbol * is a "wildcard" character representing either 0 or 1, and P 0 (·) represents the prior value on a given event. This setup corresponds to an assumption that the states of the arc-adjacent nodes are insufficient to increase the certainty that the nodes interact. We are not altering the probability that two nodes have connection no effect, but can informatively re-distribute probability between activation and inhibition.
An assumption common across versions is that if the upstream node, V i for an arc, A ij , is inactive(V i = 0), the value for A ij is sampled from the prior distribution for the arc, P (A ij = a ij ). An inactive upstream node (V i ) does not provide evidence on its "effects" on the downstream node (V j ).

Arcs update version B
(4) This version corresponds to an assumption that the states of the arc-adjacent nodes can be used as direct evidence in support of existence of an interaction.

Arcs update version C
Consider the conditional probability To simplify the computation, we assume that A ij is independent from the incoming node V i and we only consider inputs from V i and A ij when deciding the output at V j . Then, from (7), the conditional probabilities for updating A ij are derived as follows: if a = 0. (9) This update uses Bayesian probability calculation and allows for a more intuitive redistribution of probabilities. It can also be readily extended to more sophisticated updating schemes by incorporating more complex models for multiple inputs at V j .

Arcs update version D
A slightly modified version.
This arc update version differs in that when there is an input from V i (i.e., V i = 1), the arc action likelihood is updated. The alteration maintains the ratio of the probabilities on activation and inhibition, and down-weights the probability on no effect by an artificial factor P 0 (V i = 1).

Arcs update: Version E
One last proposal; a more generalized of version D.
Here the probability on no effect is down-weighted by a factor βP 0 (V i = 1) where an arbitrary (0 ≤ β ≤ 1) can be named. When β = 0) this equation is equivalent to version C. When β = 1) this equation is equivalent to version D.
1.6 Multiple arcs connected to the same node .
The scenario gets more complicated when you consider a situation where there are multiple adjacent arcs to one node. One possible way to proceed is to update each arc independently (as outlined above).We favor the simplicity of this solution. The one by one idea behind our "independent arc update" is in accordance with the way in which most interactions are discovered and published in the literature 1 . Since these interactions are independently derived, they can be tested for consistency with node-related data in the same fashion.
An alternate solution would be to update values for all the arcs adjacent to the same node simultaneously. We are not addressing this in detail in this article because of the difficulty of finding a set of assumptions that we can plausibly justify by biological data. Another complication of multiple-arc updating is the possibility of parental nodes to the given node which are not included in the data. Such a complicated update would not give good probability estimates while adding computational burden.

Supplement B: updating nodes given fixed arcs
In this section we describe different options for updating a node given fixed incoming arcs. Our topology is a directed graph to incorporate causal knowledge 2 . This allows us to calculate the joint distribution for all nodes by decomposing the computation into the product of the conditional probabilities of nodes, their parental nodes and associated arcs.
We begin with the input nodes (nodes with no parent nodes) and follow the graph until all sink nodes (nodes with no offspring nodes) are updated.
Node G is the only input node, and node D is the only sink node. The joint distribution of node values given arcs for this simple graph has the following form: Therefore, to arrive at a sample from the full joint distribution of nodes given arcs, we first update the value for the input node (G)-sampling from its prior distribution. For all non-input nodes (B, C, and D) the value is sampled from the appropriate conditional distributions (given the values of arcs and parental nodes for the node being updated). Each child node can be updated only after all parental nodes are updated.
A more general case of node updating concerns sampling of values for a node, V , that has multiple incoming edges with possibly conflicting arc values (both inhibition and activation values). We need to define the conditional probability that V i = v i given the values of upstream nodes and arcs-assuming that node V i has n parents, with values {V j = v j } j=1,...,n at the parental nodes, and the values {A ij = a ij } j=1,...,n at the direct incoming arcs of V i . Different definitions of this conditional probability actually represent different assumptions and models at the local network of V j .

Node update version A
2 This first version is a directed acyclic graph -we hope to handle cycles in the next version.
{j: a ij =−1} v j are combined incoming activating and inhibiting signals upon V i . Here φ(I + i I − i ) is a function that resolves a conflict between activating and inhibiting signals-in the simplest case it can be a single probability parameter common for the whole network. In a more sophisticated setup, the value of the parameter can vary for different nodes. In even more complex case, function φ(I +

Node update version B
This update version resolves conflicting inputs by an assumption that every conflict in input signals brings us back to the prior distribution for the node.

Node update version C
In this version, we assume that state of node V i is determined by a Boolean function over values of V j ∈ par(V i ) and values of the corresponding edges, A ij . For example, if vertex V 1 has just two parental nodes, V 2 and V 3 , as shown below: we can assume that node V 1 is active only if both parental nodes are active, V 2 = V 3 = 1, and both arcs are in the state activation, A 21 = A 31 = 1. Similarly, we can assume that to inhibit V 1 we need both incoming edges in the state inactivation, A 21 = A 31 = −1, and both parental nodes active. Then the probability to find node V 1 activated given states of parental nodes and arc is as follows.
(23) Given n inputs to a node V i , we can define the state of the node as a probabilistic function of all possible 2 n × 3 n states of the parental nodes and arcs. Unfortunately, it would be very hard to find biological data to estimate parameters for such modeling at the present state of knowledge about biological circuitry.

Supplement C: enumeration examples
In this section, we give numerical illustrations of different arc update and node update versions. For comparison, in addition to the arc updates version considered in supplement A, we also include examples where the arcs are assumed to be independent with the nodes.

Independent arcs
Consider a special case when distribution values of the arcs are fixed despite the values of incoming nodes, that is, Then, we can compute the marginal probabilities for nodes through direct summation over the appropriate joint distribution for nodes and arcs. Below we illustrate computation of the joint and marginal distributions for a simple linear network example.
Enumerated marginals for the nodes can be easily calculated:

X-shaped example
In this section, we use a more general example to illustrate different arc and node update versions, starting with simpler computations where independent arcs are assumed.

Independent arcs
Assuming independent arcs, the probabilities can be enumerated as follows.
We implemented this example with two of our node update version. Table 1 shows the results for using node update version A with φ(I + i I − i ) = 1 2 (Equation (18)). The enumerated marginals (using exact probability computation) on the nodes are compared with both the priors and the caluculated values using the proposed numerical updates with Gibbs sampler. Since we are assuming independent arcs for this computation, the marginals on the arcs are unchanged from the prior values. The resulting network is compared with the network based on prior values in Figure 1.

Node
Priors Exact marginals Gibbs sampler For comparison, we also implemented the enumeration and numerical updating using the node update version B. The results are listed in Table 2.

Node
Priors Exact marginals Gibbs sampler These examples also illustrate the difference between node update versions A and B. In the case of conflicting inputs for node C, version A assumes complete lack of information about the state of V c (φ(I + i I − i ) = 1 2 , Equation 18), while version B samples values from the prior distribution for node V C .

Updating both nodes and arcs
In this section, we apply different versions of arc and node update combinations to the same numerical x-shape example, with exactly the same prior values.

Node update Arc update
Numerical results Graph Version A Version C Tables 4 and 5 Figure 2 Version A Version D Tables 6 and 7 Figure 3 Version B Version B Tables 8 and 9 Figure 4 Version B Version C Tables 10 and 11 Figure 5 Version B Version E (with β = 0.2) Tables 12 and 13 Figure 6 Table 3: Versions of update depicted in the supplement.

Node
Priors Marginals     Table 13: Marginals for the arcs estimated with the Gibbs sampler: node update model B, arc update model E (with β = 0.2), (Gibbs sampler: 10, 000 different starting points, 100 iterations for each chain.) Figure 6: Update of both arcs and nodes of a hypothetical five-node and four-arc network (node update version B, arc update version E (with β = 0.2)-detailed numerical information corresponding to this figure is shown in Tables 12 and 13).