Approximate Bayesian inference of directed acyclic graphs in biology with flexible priors on edge states

doi:10.1371/journal.pcbi.1014039

Fig 1.

Seven topologies used in simulation studies.

Orange edges have Markov equivalent edges and cannot be deterministically inferred. (A) A mediation model where is the mediator. (B) A v-structure. True probabilities on edge states account for Markov equivalence. The values show the existence of Markov equivalent graphs for M1 and no Markov equivalent graphs for M2. (C)-(G) Larger networks that contain M1 and M2 as subgraphs. See S1–S7 Figs for the true edge-state probabilities of the orange edges.

More »

Expand

Fig 2.

An example of the output from baycn.

(A) The true graph GN4. The candidate graph used for inference will also consider only these four edges. (B) Edge states and log pseudo-likelihood for the graph accepted at each iteration of the Metropolis-Hastings-like algorithm. (C) The proportion of each edge state in the sample provides an estimate of the pseudo-posterior probability of the edge state.

More »

Expand

Table 1.

Summary of our method and other Bayesian methods for network inference.

More »

Expand

Table 2.

The mean runtime in seconds across 25 datasets. For each topology, 25 datasets were generated with and . Each algorithm was run once per dataset, and the runtime in seconds was recorded. All methods were run on an Intel Xeon D-1540 (2.00 GHz processor, 128 GB of RAM).

More »

Expand

Fig 3.

Precision, power, and MSE₂ (on the posterior probabilistic adjacency matrix) for baycn and other Bayesian methods on simulated data with varying sample sizes and signal strengths from GN4, GN11 and GN8.

A fully connected graph was used as the input to each method. For each method, we considered nine scenarios (all combinations of three sample sizes and three signal strengths) and simulated 25 independent datasets in each scenario. After applying the methods, we calculated the mean and standard deviation of each metric in each scenario. Therefore, in each plot here, every method has nine dots, all represented by one color. Each dot is the mean of a metric, and the whiskers on either side of the dot are one standard deviation of that metric. MSE₂ was not calculated for scanBMA, since the posterior probabilities from scanBMA have a different interpretation from the MCMC methods (see section “Relationship to existing Bayesian methods”).

More »

Expand

Fig 4.

Inference of the GEUVADIS eQTL-gene set Q8.

This eQTL is associated with three genes. (A)–(E) Graphs inferred by different methods with posterior probabilities shown only for edges of biological interest. The three probabilities are for the displayed direction, the opposite direction, and edge absence, respectively. 0* indicates that the corresponding direction was blacklisted during inference. BCDAG in (D) does not allow edge blacklist. scanBMA in (E) is unable to infer the probability of edge absence (hence the NAs) or distinguish the two directions. (F) Heatmap of Pearson correlations in the data. The posterior probability of all the edges from different methods are in S7–S11 Tables.

More »

Expand

Fig 5.

Combinatorial binding of transcription factors (TFs) in five tissue types from the benchmark dataset on Drosophila embryo.

[54]. The TFs are: Twist (Twi), Tinman (Tin), Myocyte enhancing factor 2 (Mef2), Bagpipe (Bap), and Biniou (Bin). The tissue types are: mesoderm (Meso), mesoderm and somatic muscle (Meso&SM), visceral muscle (VM), visceral muscle and somatic muscle (VM&SM), and somatic muscle (SM). (A) The heatmap of Pearson correlations following hierarchical clustering. (B) The graph inferred by baycn. To avoid confusion when interpreting directed edges in time-series data, we use a dot in place of an arrow. Except for bidirected edges, only the inferred direction with the corresponding posterior probability is shown. Posterior probabilities were averaged over three independent runs of 5 million iterations. Shades of the TFs reflect the timing of the TF binding: later time points correspond to darker shades. (C) Computational feasibility of each method with different input graphs. Total runtime (in minutes; on CPUs of a shared computing cluster) is reported for the four sampling-based methods over 5 million iterations. Runtime for baycn is the average from three independent runs. (D) Known relationships between tissues and TFs with the corresponding presence/absence inference from each method. A relationship is considered to be present if a method infers at least one edge between any of the time points for a given TF and its associated tissue.

More »

Expand