One model fits all: Combining inference and simulation of gene regulatory networks

doi:10.1371/journal.pcbi.1010962

Fig 1.

Single-cell data simulation using HARISSA.

(A) Networks used for subsequent tests, including feedback loop networks (FN), a cycling network (CN) and a branching network (BN). Genes and stimulus are represented by numbered nodes and an empty node, respectively, while green arrows indicate activation and red blunt arrows indicate inhibition. (B) Corresponding trajectories, defined as time-dependent mRNA levels (in copies per cell). For each network, the first plot shows one example of single-cell trajectory M while the second plot shows the population average 〈M〉 from 1000 cells. The transcriptional bursting model underlying HARISSA implies that every single-cell trajectory differs strongly from the more usual population average. (C) Two-dimensional UMAP representations of corresponding single-cell snapshots, defined as mRNA levels sampled at 10 timepoints in different cells from 0h to 96h, with 100 cells per timepoint. Such snapshots are called time-stamped data in the text and are fundamentally different from single-cell trajectories, which are currently not available experimentally.

More »

Expand

Fig 2.

Benchmark of inference methods for five different network structures.

For each network, inference is performed on ten independently simulated datasets, each dataset containing the same 10 timepoints with 100 cells per timepoint (1000 cells sampled per dataset). The performance on each dataset is then measured as the area under the precision-recall curve (AUPR), based on the unsigned inferred weights of edges. Finally, the performance of each method is summarized as a box plot of the corresponding AUPR values, or the average AUPR value for the tree-structure activation networks (Trees). For each plot, the dashed gray line indicates the average performance of the random estimator (assigning to each edge a weight 0 or 1 with 0.5 probability). For the Trees networks, each dataset corresponds to a random tree structure of fixed size (5, 10, 20, 50, and 100 genes) sampled from the uniform distribution over trees of this size. (A) Performance of all methods when considering only undirected interactions. (B) Performance of the methods able to infer directed interactions. (C) Performance of the SCRIBE inference method for the same networks, in three conditions: when one has access to real single-cell trajectories (in brown), when pseudo-trajectories are reconstructed from time-stamped data using a coupling method similar to Waddington-OT (in pink), and when a single pseudo-trajectory is reconstructed using the pseudotime algorithm SLINGSHOT (in light green). For the last two conditions, the datasets used are therefore the same as those used for the other methods.

More »

Expand

Fig 3.

Dependence of inference methods on data collection parameters.

For simplicity, only the case of undirected interactions is considered here and the datasets are restricted to 10-gene tree-structure networks (see Fig 2 for the general benchmark). Inference is performed for each method and condition on ten independently simulated datasets and summarized by box plots of AUPR values as in Fig 2. For each plot, the dashed gray line indicates the average performance of the random estimator (assigning to each edge a weight 0 or 1 with 0.5 probability). (A) Performance as a function of the number of cells per timepoint, while keeping the same timepoints. (B) Performance as a function of the length of the measurement period, while keeping the same gap between timepoints and the same total number of cells. (C) Performance as a function of the gap between timepoints, while keeping the same final timepoint and the same total number of cells.

More »

Expand

Fig 4.

Comparison between inference methods and physical interactions derived from ChIP-seq.

The four directed GRN inference methods were applied to the experimental dataset from [22] restricted to a panel of 41 marker genes identified by the authors, and a reference network was obtained independently from edges supported by ChIP-seq data. As we only have access to physical interactions involving the retinoic acid (RA) stimulus or genes Pou5f1, Sox2, and Jarid2, the comparison only considers the 4 × 41 related edges. (A) Receiver operating characteristic (ROC) curve and corresponding area under the curve (AUROC) for each inference method. (B) Precision-recall (PR) curve and corresponding area under the curve (AUPR) for each inference method. (C) Venn diagram showing the overlap, for interactions involving the RA stimulus, between directed edges predicted by CARDAMOM and known physical interactions identified by ChIP-seq analysis.

More »

Expand

Fig 5.

Network inferred by CARDAMOM from a real time-stamped scRNA-seq dataset.

The CARDAMOM inference method was applied to the experimental dataset from [22] restricted to a panel of 41 marker genes identified by the authors. The network structure is obtained by keeping only the 5% strongest activations (green arrows) and inhibitions (red blunt arrows) acting on each gene. Genes are colored according to four groups related to different cell states (pluripotency, post-implantation epiblast, neuroectoderm, extraembryonic endoderm) following the proposed classification of [22]. Edges supported by a ChIP-seq interaction are marked with black dots (see main text for the definition of what is considered as an interaction) and edges that are not supported are marked with white dots: this concerns only the edges starting from the RA stimulus, Pou5f1, Sox2, and Jarid2. Edges for which we have no reliable information have no mark.

More »

Expand

Fig 6.

Time decomposition of the network inferred by CARDAMOM from a real time-stamped scRNA-seq dataset.

Decomposition of the network shown in Fig 5, where each edge appears at the timepoint for which it was detected with the strongest intensity. This dynamic representation highlights a consistent flow of information coming from the stimulus. Gene positions and colors as well as activation and inhibition representations are the same as in Fig 5.

More »

Expand

Fig 7.

Inferred network simulations compared to the original dataset.

(A) Heatmap of p-values associated with Kolmogorov–Smirnov (KS) tests between real and simulated mRNA distributions, for each of the 41 genes of the network and for each timepoint. The green color indicates p-values greater than 5%, implying that the model output is not significantly different from the experimental dataset. (B) Time-dependent distributions of Esrrb and Sparc genes for the experimental dataset (original data, in beige) and datasets simulated after calibrating the mechanistic model, one including interactions (inferred network, in blue) and one obtained after removing interactions (without interactions, in orange). (C-D) Average earth’s mover (EM) distance (C) and average KS p-value (D) between real and simulated distributions, for the inferred network and without interactions. The dispersion corresponds to the first and ninth decile from ten simulations. (E-F-G) Two-dimensional UMAP representations of the original dataset (E) and the datasets simulated from the inferred network (F) and without interactions (G). In these three plots, Sparc is removed from the genes represented in Fig 5 as its dynamics are not well captured by the mechanistic model, so the three datasets consist of 2449 cells with mRNA levels of 40 genes.

More »

Expand