Multi-study inference of regulatory networks for more accurate models of gene regulation

doi:10.1371/journal.pcbi.1006591

Fig 1.

Gene regulatory network inference schematic.

(A) Our network inference algorithm takes as input a gene expression matrix, X, and a prior on network structure and outputs regulatory hypotheses of regulator-target interactions. (B) Using priors on network topology and gene expression data, we estimate transcription factor activities (TFA), and subsequently model gene expression as a function of these activities. (C) We use several possible sources of prior information on network topology. (D) Prior information is encoded in a matrix P, where positive and negative entries represent known activation and repression respectively, whereas zeros represent absence of known regulatory interaction. To estimate hidden activities, we consider X = PA (top), where the only unknown is the activities. Of note, a time-delay is implemented for time-series experiments (bottom). (E) Finally, for each gene, we find regulators that influence its expression using regularized linear regression. We either learn these influences, or weights, for each dataset independently, single-task learning (top), or jointly through multi-task learning (bottom).

More »

Expand

Fig 2.

Representation of the weights matrix for one gene in the multitask setting.

We represent model coefficients as a matrix W (predictors by datasets) where nonzero rows represent predictors relevant for all datasets. We decompose the weights into two components, and regularize them differently, using a sparse penalty (l₁/l₁ to S component) to encode a dataset-specific component and a block-sparse penalty (l₁/l_∞ to B component) to encode a conserved one. To illustrate, in this example, non-zero weights are shown on the right side. Note that, in this schematic example, regulators w3 and w7 are shared between all datasets. We also show the objective function minimized to estimate S and B on the bottom (for details, see Methods).

More »

Expand

Fig 3.

Multitask learning improves accuracy of inferred networks.

(A) Relationship between TF activity and target expression in B. subtilis 1 (blue) and in B. subtilis 2 (orange), and corresponding STL and MTL inferred confidence scores for an example of an interaction in the B. subtilis gold-standard, sigB to ydfP. (B) as shown in (A), but for an interaction in the S. cerevisiae gold-standard, Rap1 to Rpl12a. (C) Precision-recall curves assessing accuracy of network models inferred for individual B. subtilis datasets against a leave-out set of interactions. Barplot show mean area under precision-recall curve (AUPR) for each method and dataset. Error bars show the standard deviation across 10 splits of the gold-standard into prior and evaluation set. (D) Precision-recall curves assessing accuracy of network models inferred for individual S. cerevisiae networks, with the difference that the prior is from an independent source (no splits or replicates).

More »

Expand

Fig 4.

Multitask learning performance boost outweights benefits of other data integration methods.

Assessment of accuracy of network models learned using three different data integration strategies, data merging and batch correction (STL-BC), ensemble method combining models learned independently (STL-C), and ensemble method combining models learned jointly (MTL-C). (A) Precision-recall curves for B. subtilis, again using a leave-out set of interactions. Barplot show mean area under precision-recall curve (AUPR) for each method. Error bars show the standard deviation across 10 splits of the gold-standard into prior and evaluation set. (B) Precision-recall curves for S. cerevisiae, with the difference that the prior is from an independent source (no splits or replicates).

More »

Expand

Fig 5.

Recovery of prior interactions depends on prior quality and is robust to increasing prior weights.

Distribution of number of regulators per target in the B. subtilis prior (A), for the S. cerevisiae gold-standard (B), and for the S. cerevisiae chromatin accessibility-derived priors (C). (D) Distributions of MTL inferred confidence scores for interactions in the prior for each dataset. Different colors show prior weights used, and represent an amount by which interactions in the prior are favored by model selection when compared to interactions without prior information. (E) Distributions of MTL inferred confidence scores for true (yellow) and false (gray) interactions in the prior for each dataset. (F) Counts of MTL inferred interactions with non-zero confidence scores for true (yellow) and false (gray) interactions in the prior for each dataset.

More »

Expand

Fig 6.

Overlap of edges in inferred networks is higher for B. subtilis than for S. cerevisiae.

Edges overlap across networks inferred using multitask learning for B. subtilis (prior weight of 1.0) (A), for S. cerevisiae (using the gold-standard as priors) (B), for S. cerevisiae (using the chromatin accessibility-derived priors) (C).

More »

Expand