A Bayesian network approach incorporating imputation of missing data enables exploratory analysis of complex causal biological relationships

Bayesian networks can be used to identify possible causal relationships between variables based on their conditional dependencies and independencies, which can be particularly useful in complex biological scenarios with many measured variables. Here we propose two improvements to an existing method for Bayesian network analysis, designed to increase the power to detect potential causal relationships between variables (including potentially a mixture of both discrete and continuous variables). Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis. This can be wasteful and undesirable when there are many individuals with missing data, perhaps with only one or a few variables missing. This motivates the use of imputation. We present a new imputation method that uses a version of nearest neighbour imputation, whereby missing data from one individual is replaced with data from another individual, their nearest neighbour. For each individual with missing data, the subsets of variables to be used to select the nearest neighbour are chosen by sampling without replacement the complete data and estimating a best fit Bayesian network. We show that this approach leads to marked improvements in the recall and precision of directed edges in the final network identified, and we illustrate the approach through application to data from a recent study investigating the causal relationship between methylation and gene expression in early inflammatory arthritis patients. We also describe a second improvement in the form of a pseudo-Bayesian approach for upweighting certain network edges, which can be useful when there is prior evidence concerning their directions.

more quickly by running multiple instances of the serial version of BayesNetty, and then combining results to get the final result. To calculate the average network, the data is bootstrapped many times and the best fit network fitted; the resultant networks are then combined to get the average network. With enough processors, these bootstrap iterations can all be done in parallel. Similarly, for data imputation, as many processors as individuals with missing data can be used. To investigate the resulting timings, the data imputation and average network were calculated using 1000 processors running the serial version of BayesNetty. The processes were run on a high performance computer with 2 Intel Xeon E5-2699 v4 processors (2.2 GHz), 2.9 GB memory per core.
The timings of BayesNetty to fit a best fit BN are given in the Additional Tables below. The time to fit a BN with 31 nodes, which could be considered reasonably large, using the non-parallel version of BayesNetty takes less than one second. For larger networks the timing is also reasonable taking just over 6 minutes for a network with 496 nodes. Earlier implementations which naively recomputed the network score for each step took over 6 minutes for a network of size 31, over an hour for 62 nodes and was not feasible for 496 nodes. The times can be further decreased with the use of the parallel version of BayesNetty which uses the Open MPI parallel programming system. There is some computational cost using Open MPI due to setup and processors communicating with one another. The setup cost is shown by comparing the computing timings in the MPI(1) column with the serial timings. The parallel timings are much quicker and are around 10 times quicker for the largest networks, although for the network with 31 nodes the parallel timings are actually a bit slower due to the extra cost involved. There is little to be gained from using more than 16 processors and may even be slower for very large networks. The optimum number of processors in general depends on the data structure, data size and the computing system, so some experimentation may be needed.
The timings to calculate an average network are also shown in the Additional Tables below, which shows that by far the quickest approach is to use the simple parallel approach which runs 1000 BayesNetty serial programs and collates the results. Using the parallel version of BayesNetty does speed up the computation by increasing the network search at each step.
The timings to impute data with BayesNetty are also shown below and show a similar pattern to the average network timings for the same reasons. Again it can be seen that it is not advantageous to use more than 16 processors. The imputation timings for the EM algorithm are much quicker than BayesNetty when comparing with the serial version, but, as imputation in BayesNetty can be ran in parallel, it is possible to impute data quicker in BayesNetty.
For most data applications, which typically consist of less than 50 nodes, there is probably no advantage in using the parallel version of BayesNetty to fit a best fit BN. For calculating an average network or imputing data, use of the simple "serial" parallel approach is recommended for larger networks. The BayesNetty website provides scripts to help do this. The timings for larger networks if one uses this approach would be about the same time as it takes to fit one BN in serial, assuming there are enough separate processors.

Best Fit Network Search Timings in Seconds
Nodes Serial (RT) MPI(1) MPI (8)  Additional Tables for Text S2: Time in seconds to perform various calculations with the BayesNetty software. The standard non-parallel times are given by the serial column. Parallel processor times using Open MPI are denoted by MPI and the number of processors in brackets (note MPI(1) uses Open MPI but is not in parallel). The Parallel 1000 column refers to 1000 BayesNetty serial programs running in parallel to calculate either the average network or data imputation, with the results collated to give the final answer. Imputation with random training is denoted by RT and for complete training by CT.