^{1}

^{*}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: RN. Performed the experiments: RN. Analyzed the data: RN MS. Contributed reagents/materials/analysis tools: RN MS. Wrote the paper: RN MS.

Molecular entities work in concert as a system and mediate phenotypic outcomes and disease states. There has been recent interest in modelling the associations between molecular entities from their observed expression profiles as networks using a battery of algorithms. These networks have proven to be useful abstractions of the underlying pathways and signalling mechanisms. Noise is ubiquitous in molecular data and can have a pronounced effect on the inferred network. Noise can be an outcome of several factors including: inherent stochastic mechanisms at the molecular level, variation in the abundance of molecules, heterogeneity, sensitivity of the biological assay or measurement artefacts prevalent especially in high-throughput settings. The present study investigates the impact of discrepancies in noise variance on pair-wise dependencies, conditional dependencies and constraint-based Bayesian network structure learning algorithms that incorporate conditional independence tests as a part of the learning process. Popular network motifs and fundamental connections, namely: (

Identifying associations and network structures from observational data sets obtained across a given set of entities is a challenging problem and of great interest across a spectrum of disciplines including molecular biology

Molecular data obtained from biological systems may or may not have explicit temporal information. While the former explicitly captures the evolution of the molecular activity as a function of time (

Of interest, is to note that these molecular data sets are inherently noisy

Prior to investigating the impact of noise on the constraint-based Bayesian network structure learning algorithms, its impact on pairwise and conditional dependencies across the three network motifs is investigated.

In the following discussion,

The common-effect network motif (

The correlation coefficients are given by

The partial correlations are given by

For large noise limit at z

The partial correlations are given by

For large noise limit at y

The partial correlations are given by

Consider the three-chain network motif

The correlation coefficients are given by

The partial correlations are given by

For large noise limit at

The partial correlations are given by

For large noise limit at

The partial correlations are given by

Consider the coherent type-I feed-forward loop

The correlation coefficients are given by

The partial correlations are given by

For large noise limit at

The partial correlations are given by

For large noise limit at

The partial correlations are given by

Common-Effect | 0 | 0 | 1 | 1 | ||

Three-Chain | 0 | 0 | 1 | 0 | 0 | 1 |

Type I FFL | 0 | 0 | 1 | 1 | ||

0 | 0 | 0 | 0 | 0 | 0 | |

0 | 0 | 0 | 0 | |||

0 | 0 | 0 | 0 |

Bayesian network structure learning algorithms have been used successfully to infer the associations between a large numbers of variables. Several such algorithms have been proposed in literature, a partial list of contributions include

GS was the first algorithm that learned the

For relatively large noise variance at

For relatively large noise variance at

For IAMB, the conditional independence tests are performed in a different order since the nodes are included in the Markov blankets in decreasing order of association. However, the resulting Markov blankets

For relatively larger noise variance at

As in the case of common-effect network motif, reordering of the conditional independence tests in IAMB does not result in Markov blankets different from those inferred by GS. Unlike common-effect motif, no asymmetry between the Markov blankets is observed for the three-chain, since

For relatively large noise variance at

In this case, no asymmetry is observed despite the effects of noise. Nevertheless, neither GS nor IAMB was able to learn the motif for relatively large noise variance.

For relatively large noise variance at

In the present case, discrepancy in noise variance does not result in asymmetry in the Markov blankets. Thus, symmetry correction may not alleviate the impact of noise. Possible motif structures corresponding to large discrepancies at

For relatively large noise variance

Asymmetry between the Markov blankets is observed across

In the following discussion, the three gene network motifs are generated using (1, 8, 15) with parameter

Results generated using constraint-based structure learning algorithms GS and IAMB were quite similar consistent with their expected behaviour, Section 2.2. Therefore, we discuss only the results from the GS algorithm. The networks were learned across 200 independent realizations of the data (sample size = 2000) and Friedman's

The common-effect network motif,

The

The confidences of the edges

The three-chain network motif,

The coherent Type-I feed-forward loop network motif,

In a recent study

Three different networks

The networks

Confidences estimated from 200 independent bootstrap realizations are shown along the edges. Edges with

In order to minimize the impact of skewness on the conclusions, the entire exercise was repeated on the log-transformed data. The resulting networks along with confidence of the edges are shown in

Real-world entities work in concert as a system and not in isolation. Associations between such entities are usually unknown. Inferring associations and network structure from data obtained across the entities is of great interest across a number of disciplines. The recent surge of high-throughput molecular assays in conjunction with a battery of algorithms has facilitated validating established associations while discovering new ones with the potential to assist in novel hypothesis generation. These associations and networks have been shown to capture possible causal relationships under certain implicit assumptions and proven to be useful abstractions of the underlying signaling mechanism. Such an understanding can provide system level insights and often precedes developing meaningful interventions. Several network inference algorithms have been proposed in literature including those that depend on pairwise and conditional dependencies. However, little attention has been given to the impact of possible discrepancies in noise variance across the data obtained across the molecular entities. In molecular settings, such discrepancies can be attributed to several factors including inherent stochastic mechanisms, heterogeneity in cell populations, variations in abundance of the molecules, variation in binding affinities, sensitivity of the measurement device and other experimental artifacts. Understanding the discrepancies in noise variance is critical in order to avoid spurious conclusions and an important step prior to identifying the source of the noise.

The present study clearly elucidated the non-trivial impact of discrepancies in noise variance on associations and network inference algorithms across synthetic as well as experimental data. The impact of large discrepancies in noise variance on associations and network structure inferred from data generated using linear models of popular network motifs and fundamental connections as well as those from experimental protein expression profiles were investigated. Analytical expressions and simulations were presented elucidating the non-trivial impact of noise on three popular molecular network motifs and fundamental connections (common effect, three-chain and coherent Type-I feed-forward loop). It was shown that discrepancies in noise variance can significantly alter the results of pairwise dependencies, conditional dependencies as well as constraint-based Bayesian network structure learning techniques that implicitly rely on tests for conditional independence. As expected, the discrepancies in noise variances was found to result in markedly different topologies from those of their noise free counterpart challenging reliable inference of the underlying network topology. Such discrepancies were also shown to result in spurious conclusion of similar structures across markedly distinct network topologies. The impact of discrepancies in noise variance were also investigated on publicly available single-cell molecular expression profiles of a sub-network comprising of three molecules (PIP2, PIP3, Plcγ) involved in human T-cell signaling. The sub-network shared considerable resemblance to the coherent Type-I feed-forward loop. The distribution of the raw expression estimates across these three molecules was positively skewed indicating large variations in the expression estimates across the single-cells. Variance about the average expression across the three molecules was found to be markedly different and proportional to their average values. Several factors can contribute to such discrepancies including: abundance of these molecules, antibody binding characteristics, uncertainty due to possible overlap in the wavelengths corresponding to the colors tagged to the molecules. In the present study, a linear model was fit to the molecular expression data. Parameter estimates from the linear model indicated significant discrepancies in the noise variances across the molecules. Adjusting for these discrepancies in the model was shown to significantly affect the edge confidences of the resulting networks, hence the topology. The results were presented on the raw molecular expression data as well as its log-transformed counterpart. As expected, log-transforming the data not only reduced the positive skew of the expression profile but also rendered the noise variance estimates comparable across the molecules. However, the networks inferred using the log-transformed data were considerably different from those inferred on the raw data. While identifying the source of the variation and controlling for the same prior to the network inference may be the long-term goal and a research problem in its own merit, understanding the impact of discrepancies in noise variance is a critical step in this direction. While the present study focused on simple network motifs comprising of three molecules, the concerns are likely to be aggravated across more complex network topologies. The analytical treatment provided in the present study has the potential to be translated across other setting such as genome-wide association studies (GWAS)