^{1}

^{2}

^{2}

^{2}

^{3}

^{4}

The authors have declared that no competing interests exist.

Pathway analysis is widely used to gain mechanistic insights from high-throughput omics data. However, most existing methods do not consider signal integration represented by pathway topology, resulting in enrichment of convergent pathways when downstream genes are modulated. Incorporation of signal flow and integration in pathway analysis could rank the pathways based on modulation in key regulatory genes. This implementation can be facilitated for large-scale data by discrete state network modeling due to simplicity in parameterization. Here, we model cellular heterogeneity using discrete state dynamics and measure pathway activities in cross-sectional data. We introduce a new algorithm, Boolean Omics Network Invariant-Time Analysis (BONITA), for signal propagation, signal integration, and pathway analysis. Our signal propagation approach models heterogeneity in transcriptomic data as arising from intercellular heterogeneity rather than intracellular stochasticity, and propagates binary signals repeatedly across networks. Logic rules defining signal integration are inferred by genetic algorithm and are refined by local search. The rules determine the impact of each node in a pathway, which is used to score the probability of the pathway’s modulation by chance. We have comprehensively tested BONITA for application to transcriptomics data from translational studies. Comparison with state-of-the-art pathway analysis methods shows that BONITA has higher sensitivity at lower levels of source node modulation and similar sensitivity at higher levels of source node modulation. Application of BONITA pathway analysis to previously validated RNA-sequencing studies identifies additional relevant pathways in

21st-century biotechnology has enabled measurements of genes and proteins at large scale by RNA sequencing and proteomics technologies. In particular, RNA-sequencing has become a first step of unbiased interrogation. These studies frequently produce a long list of differentially abundant genes, which become interpretable by widely used pathway analysis methods. The pathway topologies frequently include information on how genes interact and influence each other’s expression, but current methods do not utilize this information to estimate signal flow through each pathway. We have developed a model of binary (on/off) behavior that accounts for varying expression across samples as different proportions of cells expressing genes. We model signal flow by averaging repeated simulations of individual cells passing binary signals through molecular networks. We use this model to infer regulatory rules explaining gene expression. These rules of signal integration for all nodes in the network are used to identify the most important genes, and to determine if a pathway’s activity is different between two groups. BONITA compares favorably to previous approaches using simulated and real data. Furthermore, application to 36 datasets from 15 different diseases demonstrates BONITA’s exceptional ability to detect drug targets.

This is a

Gene set and pathway analysis have become one of the first choices for gaining mechanistic insights from high-throughput sequencing and gene/protein profiling techniques [

Discrete state network modeling has been used to study high throughput gene and protein profiling data collected across multiple time-points by utilizing two different underlying models of variation [

Here, we describe BONITA- Boolean Omics Network Invariant-Time Analysis, to capture cellular heterogeneity, a critical source of variability in transcriptomic data. A portion of variance in gene expression stems from heterogeneity in the activation state of cells in addition to variation in expression levels within each cell. This is demonstrated by gene expression in multiple stem cell types [

BONITA is currently implemented and tested for application to transcriptomics data, but work is under way to apply it to other types of data including proteomics, metabolomics, and phosphoproteomics. BONITA is rigorously tested using simulated data and is applied to publicly available experimental datasets. In addition, a comparison of BONITA-RD to an existing algorithm for time-course data [

BONITA network propagation (NP) runs on prior knowledge networks obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) using the KEGG API. Activating/inhibiting relationships are inherited from KEGG edge attributes [

BONITA-RD implements a combination of a genetic algorithm and a node-wise local search to infer logic-rules. BONITA assumes cross-sectional samples represent steady states and minimizes change after simulation of a network as given by:
_{i,j} is the value of node _{i,j} is the value of node

Rhomboids represent inputs while rectangles represent calculation steps. MSE is mean square error given by

The genetic algorithm generates new rule sets (individuals) either by selecting rules for randomly chosen nodes from their parent rule sets, or by mutating (altering) a particular rule and incoming nodes. At later generations, crossover events tend to produce rule sets that have already been tried in earlier generations, leading to a greater probability of mutations. The space of potential rules is extremely large and scales quickly with in-degree. Hence, to reduce the space of potential rules to a region that can be sampled, a maximum of three upstream regulators are selected. This is a compromise between decreasing resolution and increasing search time. The three upstream regulators (U) are sampled for nodes with >3 upstream regulators in the genetic algorithm using a probability function _{U,N} is the Spearman correlation of upstream regulators with the node (N) for which the rule is being determined. For all simulations shown in this report, the genetic algorithm was run for 120 generations from a starting population and constant population size of 24. Thus, 24 new rule sets were generated and tested at each generation. Decreasing errors (

a) Sum of squares of node-wise error (SSE) are plotted across genetic algorithm generations for three representative test networks with varying complexity (node degree = 346, 108, and 29). b) Percent rules true (y-axis) identified by BONITA-RD are plotted against the number of nodes in 11 networks (x-axis) for 25 simulated trials. Error bars represent standard error. c) The percent rules true in ERS (y-axis) for each node are plotted against the in-degree of those nodes in the prior knowledge network. d) The log average total size of ERS for each node are plotted against the log total ancestor overlap which is the sum of the pairwise shared ancestor number between any two upstream nodes.

This exhaustive search only evaluates the possible rules at each node while holding constant all other rules as well as the incoming edges to that node as determined by the genetic algorithm. The node-level local search was initiated with the minimal error rule set from the genetic algorithm and was found to be effective in inferring the rules as shown in the results (

To test BONITA-RD, simulated data representing 5 samples was generated by BONITA-NP with a rule set and initial states determined by a uniform random distribution. Rules determined by BONITA-RD were then compared with the rule-set used to generate the data.

BONITA-PA seeks to prioritize nodes that have a large influence over signal flow through the network by assigning node-level impact scores. The impact score, _{g}, captures the change induced in the network when the node is perturbed. _{g} is given by the difference in network state after knockout and knock-in of g:
_{i,j} and _{i,j} are BONITA-NP outputs when _{p} is calculated as follows:
_{g} is the fold difference of _{p} values is generated by weighting impact scores for a specific pathway’s topology with random fold differences that are re-sampled from the gene expression data. Pathways with at least four genes in the transcriptomic data are considered.

To compare BONITA-PA with existing pathway analysis approaches, simulated datasets that resembled biological data were constructed. The data was generated using a negative binomial distribution with gene-wise means and dispersions from existing RNA-seq data [

BONITA was rigorously assessed using RNA-seq data. First, BONITA was compared with state-of-the-art pathway analysis approaches using data from the public domain. Second, BONITA’s specificity in detecting disease specific pathway from patient data was investigated. Finally, BONITA’s ability to infer rules from a

Comparison of BONITA pathway analysis with CLIPPER and CAMERA was performed using previously published RNA-sequencing data measuring IFN-

To test whether BONITA identifies disease specific pathways, microarray gene expression data from a set of 36 experiments comparing patients to healthy controls in 15 unique diseases was analyzed [

Finally, a

BONITA is written entirely in Python and C using genetic algorithms from deap [

BONITA- Boolean Omics Network Invariant-Time Analysis- is designed to leverage variance driven by cellular heterogeneity and signal integration for advanced pathway analysis of cross-sectional data, frequently available in translational studies. The accuracy and robustness of BONITA-RD for cross sectional data was assessed using a series of simulation studies, and its application to a

BONITA Network Propagation (BONITA-NP) propagates continuous-valued signals across molecular networks with the assumption that bulk transcriptomic measurements are proportional to the number of cells expressing specific genes. The signal propagation depends on the inference of logic rules performed by BONITA rule determination (BONITA-RD), which is optimized to preserve steady states assumed to be represented by the cross-sectional data. The logic rules define integration of signals coming from different genes. To test the performance of BONITA-RD, a subset of networks were obtained by searching the KEGG database for Interferon Gamma (IFN-

ERS facilitated evaluation of accuracy of BONITA-RD within the limits of cross-sectional data. BONITA-RD accuracy reached 87—99% when considering ERS among test networks (

The strikingly high accuracy across diverse networks when considering ERS demonstrates that BONITA rule inference can correctly infer rules to the extent they are distinguishable by cross-sectional data. Next, we investigated the impact of network complexity on BONITA-RD. To assess the impact of network size BONITA-RD accuracy was compared with the number of nodes in each test network. Though the test networks have a wide range of sizes, node numbers did not explain differences in accuracy across networks (

Having established the ability of BONITA-RD to recover rules from large-scale data, we wanted to establish BONITA’s robustness to other important factors in transcriptomic data: sample number and technical noise. Susceptibility of BONITA to technical noise was investigated by adding random noise in the range 1-200% for each node in the network. The BONITA-RD accuracy remains >80% with up to 10% noise in the data (

ERS (y-axis) for (a) 1% to 200% noise (x-axis) and (b) for 2 to 15 number of samples are reported across the test networks. Error bars represent standard error.

Typically, pathway topologies available in databases are generalized cases that can lead to false positive edges not relevant to the context of a specific study. Hence, the robustness of BONITA to false positive edges in the prior knowledge network was assessed and compared to the existing algorithm that utilized discrete state modeling [

(a) Boxplots show minimum structural distance among the equivalent rule sets learned by BONITA-RD (y-axis) or (b) structural distance among rules learned by Liu’s method across prior knowledge network [

Pathway analysis is the most useful functionality of BONITA-RD. Briefly, nodes of network representing pathway are perturbed _{2} attenuation of 0.5. All the methods performed well in detecting the number of pathways for induced attenuation >1 in source nodes (_{2} attenuation of 2.0. The same results hold when attenuation is not propagated through the downstream nodes of the pathways (

(a) The number of pathways found to be significant upon attenuation of the source node by _{2} 0.0, 0.5, 1, 1.5, or 2 and (b) Receiver operating characteristic (ROC) curves for _{2} induced attenuation of 0.5 and 2.0 by BONITA-PA (green), CLIPPER (orange) and CAMERA (blue). The total number of pathways tested were 60 for each attenuation using 10 simulated RNA-seq datasets and 6 test networks. ROC curves were constructed by treating −_{10}p-values from 0.0 attenuation as one class and −_{10}p-values from 0.5 or 2.0 as the other class.

BONITA’s excellent performance on simulated data and in modeling pathway modulation calls for verifying its performance in similar experimental setting. RNA-seq data from our previous study investigating Interferon-regulated genes (IRG) following stimulation of human choriocarcinoma (Jar) cells with IFN-

Following stimulation of human choriocarcinoma (Jar) cells with IFN-

Detecting specific pathway signals is a major challenge in genome-wide sequencing studies of human samples due to variation across individuals. Previously, we have measured changes in isolated CD4+ T cells from infants with mild and severe respiratory syncytial virus (RSV) infection by genome-wide mRNA sequencing [

Pathway/Method: | BON | CLP | CAM |
---|---|---|---|

Apoptosis | 1.33 | 0.64 | 0.42 |

Cell adhesion molecules (CAMs) | 0.10 | 0.64 | 1.46 |

Complement and coagulation cascades | 2.56 | 0.39 | 0.01 |

Glycolysis / Gluconeogenesis | 0.05 | 0.51 | 1.44 |

−_{10} p-values are shown for analysis of infants with mild vs severe disease at convalescent visit with BONITA (BON), CLIPPER (CLP), and CAMERA (CAM). Significant pathways are highlighted.

Further, BONITA-PA produces helpful network synthesis, including rules, which can be visualized easily in a network viewer such as Cytoscape, as in

Small circular nodes indicate ‘and’ rules whereas multiple incoming edges to a rectangular node indicate ‘or’ rule. Colors of the rectangles ranging from white to red indicate low to high impact score. Widths of the rectangles’ outlines and their color ranging from blue to green indicate fold difference (mRNAs) between infants with severe vs mild disease. Blue represents higher expression in mild and green represent higher expression in severe. Impact scores have been divided by the largest impact score in the pathway.

To further test BONITA’s specificity, data from Ihnatova et al. was used [

One of the applications of BONITA is to define co-operativity in networks inferred from the data. Mutual information-based inductive causation (

BONITA is, to our knowledge, the first ever attempt to use discrete-state modeling for pathway analysis and builds upon decades of work to calculate node impacts in Boolean networks, Probabilistic Boolean Networks and fuzzy logic networks [

BONITA-RD is a novel approach to rule determination for cross-sectional data that offers significant advantages over previous algorithms. Existing software can solve the key problem of Boolean rule determination for large-scale omics datasets by use of genetic [

Approaches to solving networks for cross-sectional data must apply more general optimization solutions because there are no explicit transitions available. Though efficient, genetic algorithms often do not find the best configuration when combinatorial possibilities are high, i.e., when network topology is complex. BONITA-RD combines an exhaustive node-wise local search with a genetic algorithm and achieves high accuracy in determining rules from simulated data. While local search improves accuracy, it is dependent on an initial global search to resolve the complexity of the networks. BONITA-RD is robust to inaccuracies in prior knowledge networks, noise, and number of samples. This optimization happens relatively rapidly within the genetic algorithm (

In addition to making rule determination possible from cross-sectional data, the BONITA-NP algorithm accounts for cellular heterogeneity by explicitly modeling a population of cells with a distribution of on/off starting states, rather than from varying levels of expression in each as modeled by fuzzy models. Not all genes vary in a switch-like manner by cell, but those that vary in a fuzzy manner will be implicitly modeled (with similar accuracy) in a pseudo-switch-like manner, because the internal direction of gene activation will remain the same. As expected, this model outperforms purely Boolean approaches in terms of error across pathways (

Rigorous testing of rule inference is a difficult problem. The DREAM challenge provides rigorously validated time-series data sets for evaluation of novel algorithms; however, no such test sets exist for rule inference from cross-sectional data. Hence, a well-controlled study from our collaborator Dr. Shawn Murphy was used to validate BONITA [

Interestingly, mutually exclusive pathways were identified by CAMERA and BONITA at the convalescent visit after RSV infection but no pathways identified by CLIPPER. The pathways were most likely mutually exclusive because BONITA has higher sensitivity to detect pathways with upstream changes that are linked to downstream variation whereas CAMERA will detect changes in downstream genes as observed in the glycolysis pathway even in the absence of corresponding changes in upstream regulators of a pathway. Non-signaling networks like glycolysis may have unclear signal flows, as described in the Methods, or may contain many loops. In these cases, BONITA’s performance will be similar to that of other gene-set analysis methods instead of the enhanced performance observed on signaling networks when causation of downstream events can be linked to the upstream changes.

BONITA-PA explicitly provides increased impact to upstream nodes in the context of downstream nodes. This quantitative prioritization of upstream signaling and relative modulation highlights nodes and interactions that make pathways most interesting for further exploration. The utility of such an approach is underscored by the effectiveness of BONITA impact scores in identifying drug targets. Thus, BONITA provides a unique perspective and new capabilities to maximize the utility of transcriptomics experiments in guiding future studies. Further, BONITA can be applied to

BONITA-RD and impact score calculation were applied to a network generated by mutual information-based inductive causation (miic) [

In conclusion, BONITA introduces a new, useful, and conceptually elegant approach to considering variance in transcriptomic data. BONITA is theoretically applicable to any directed network, including

−

(XLSX)

Results of BONITA simulation across human disease data sets (p-values) along with the number of drug targets and, if applicable, p value of t test of impact scores between drug targets and non-drug target nodes.

(XLSX)

(EPS)

BONITA-RD was optimized using the RSV infection data [

(EPS)

(EPS)

Values in the labeled cells represent the Pearson correlation coefficient. Colors also represent Pearson correlation coefficient, ranging from -1 (dark blue) to 1 (dark red).

(EPS)

(a) The number of pathways out of ten found to be significant in simulated RNA-seq data with source nodes of 10 random data sets each of 6 test networks attenuated by _{2} 0.0, 0.5, 1, 1.5, or 2 without propagation to downstream nodes. (b) Receiver operating characteristic (ROC) curves for _{2} induced attenuation of 0.5 and 2.0 without propagation to downstream nodes. Receiver operating characteristic (ROC) curves were constructed by treating −_{10} p-values from 0.0 attenuation as one class and −_{10} p-values from 0.5 or 2.0 as the other class. Green represents BONITA-PA, orange represents CLIPPER and blue represents CAMERA in both a and b.

(EPS)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

Stephen Constable wrote BONITA’s KEGG parser, and Arica VanderWal assisted with