^{1}

^{2}

^{3}

^{1}

The authors have declared that no competing interests exist.

Across diverse biological systems—ranging from neural networks to intracellular signaling and genetic regulatory networks—the information about changes in the environment is frequently encoded in the full temporal dynamics of the network nodes. A pressing data-analysis challenge has thus been to efficiently estimate the amount of information that these dynamics convey from experimental data. Here we develop and evaluate decoding-based estimation methods to lower bound the mutual information about a finite set of inputs, encoded in single-cell high-dimensional time series data. For biological reaction networks governed by the chemical Master equation, we derive model-based information approximations and analytical upper bounds, against which we benchmark our proposed model-free decoding estimators. In contrast to the frequently-used k-nearest-neighbor estimator, decoding-based estimators robustly extract a large fraction of the available information from high-dimensional trajectories with a realistic number of data samples. We apply these estimators to previously published data on Erk and Ca^{2+} signaling in mammalian cells and to yeast stress-response, and find that substantial amount of information about environmental state can be encoded by non-trivial response statistics even in stationary signals. We argue that these single-cell, decoding-based information estimates, rather than the commonly-used tests for significant differences between selected population response statistics, provide a proper and unbiased measure for the performance of biological signaling networks.

Cells represent changes in their own state or in the state of their environment by temporally varying the concentrations of intracellular signaling molecules, mimicking in a simple chemical context the way we humans represent our thoughts and observations through temporally varying patterns of sounds that constitute speech. These time-varying concentrations are used as signals to regulate downstream molecular processes, to mount appropriate cellular responses for the environmental challenges, or to communicate with nearby cells. But how precise and unambiguous is such chemical communication, in theory and in data? On the one hand, intuition tells us that many possible environmental changes could be represented by variation in concentration patterns of multiple signaling chemicals; on the other, we know that chemical signals are inherently noisy at the molecular scale. Here we develop data analysis methodology that allows us to pose and answer these questions rigorously. Our decoding-based information estimators, which we test on simulated and real data from yeast and mammalian cells, measure how precisely individual cells can detect and report environmental changes, without making assumptions about the structure of the chemical communication and using only the amounts of data that is typically available in today’s experiments.

For their survival, reproduction, and differentiation, cells depend on their ability to respond and adapt to continually changing environmental conditions. Environmental information must be sensed and often transduced to the nucleus, where an appropriate response is initiated, usually by selectively up- or down-regulating the expression levels of target genes. This information flow is mediated by biochemical reaction networks, in which concentrations of various signaling molecules code for different environmental states or different response programs. This map between environmental input or response output and the internal chemical state is, however, highly stochastic, because typical networks operate with small absolute copy numbers of signaling molecules [

Information theory provides a framework within which the theoretical study of limits to communication as well as the empirical study of actual information flows can be addressed [

Recent theoretical work analyzed the reliability of information transmission through specific reaction systems in the presence of molecular noise, e.g., during ligand binding [

Empirical estimates of information transmission in biochemical networks similarly focused on the steady state [^{4}) numbers of sampled response trajectories, thereby permitting direct information estimates using generic estimators like the k-nearest-neighbors (knn) [

At their core, cellular processes consist of networks of chemical reactions. A chemical reaction network consists of a set of

If we assume that the system is well-stirred, in thermal equilibrium and the reaction volume is constant, it can be shown that the probability that a reaction of type _{k} is a constant that depends on the physical characteristics of the cell but also on the environmental conditions.

Let us denote the probability that

The CME given in

To study information transmission through the biochemical networks described by the CME, we need to define the input and output signals. In the simplest setup considered here, the input ^{(1)}, ^{(2)}, …, ^{(q)}}. Each input in general corresponds to a distinct set of reaction rate constants

Information theory introduces the mutual information as the measure of fidelity by which changes in one random variable, e.g., the input _{U} and _{X} are the marginal density functions for

Mutual information is a non-negative symmetric quantity that is measured in bits, and is zero only if _{U}(

_{U} that the network receives. In this work, we will consider discrete inputs and will assume uniform _{U}. It is, however, also possible to compute the

The setup we consider here is one in which inputs

For fully-observed reaction networks whose dynamics are governed by a known chemical Master equation, information can be approximated to an arbitrary accuracy via Monte Carlo integration for either continuous-time or discrete-time response trajectories (model-based

Given the specification of the biochemical reaction network in _{1}, _{2}, …, _{r}], where _{1}, …, _{r}], 0 < _{i} < _{1}) is given by the initial conditions of the process, and the transition matrix _{X}(

We can resample the continuous trajectories ^{0}, …, ^{d}], and its realizations, the discrete trajectories, as

In the discrete case, the likelihood of _{exact}, as in the continuous case: we get the marginal _{X}(

In the absence of a full stochastic model for the biochemical reaction network, mutual information estimation is tractable only if we make assumptions about the distribution of response trajectories given the input. We briefly summarize two approaches below: in the first, k-nearest-neighbor procedure, the space in which the response trajectories are embedded is assumed to have a particular metric; in the second, Gaussian approximation, we assume a particularly tractable functional form for the channel,

The idea of using the nearest neighbour statistics to estimates entropies is at least 70 years old [

A simplifying assumption in the Gaussian approximation is that the distribution of trajectories sampled at discrete times given input is approximately Gaussian, with the mean _{U}(_{G}(_{G}(_{G}(_{G}(_{G}(

To apply this estimator, one must use real (or simulated) data to estimate the conditional mean,

Here and in the next section we introduce a class of decoding-based calculations that lower-bound the exact information, _{i} and _{i}, for ^{(1)}, …, ^{(q)}} and ^{2} − 10^{3} trajectories) in case of model-free information estimates, or trajectories generated by exact simulation algorithms (in which case the sample size,

The procedure of estimating the input _{i} in the dataset a corresponding “decode” _{d} represents time discretization, form a Markov chain. In other words, the distribution of

To compute the information lower bound, we apply the decoding function to each trajectory in _{ij} counts the fraction of realizations of ^{(i)} that decode into

In MAP lower bound, the decoding function _{ω} is given by Bayesian inference of the most likely input

The MAP inference consists of finding the input that maximizes the posterior distribution [_{U}(

One can apply the MAP-decoding based calculation of

Note that even though the MAP decoder is optimal, it does not follow that

Given that the optimal MAP decoding does not necessarily reach the exact mutual information, it is reasonable to ask how large the gap is between these two quantities. For discrete inputs, classic work in information theory proved a number of upper bounds on this gap when the channel is known [_{UB}(

Our self-contained derivation [

The first model-free decoding approach we consider is based on classifiers called Support Vector Machines (SVMs). To begin we consider two possible inputs, _{ω} by means of a helper function _{ω}(_{ω}(^{(1)} if sign _{ω}(_{ω}(^{(2)} otherwise. Here,
_{t} is the number of samples in _{i} = −1 whenever the input corresponding to the _{i}, is ^{(1)}, i.e., _{i} = ^{(1)}; similarly _{i} = +1 whenever the corresponding input is ^{(2)}, i.e., _{i} = ^{(2)}.

To prevent overfitting and set the regularization parameter _{t} (here ~ 70% of the total, _{SVM} [

When we apply SVM decoding, we are still free to choose the kernel function. Here, we focus on two possibilities:

^{T}

For multiclass classification we use a decision-tree SVM classification method [

In this model-free estimation, we revisit the assumption that the (discretely sampled) output trajectories

This method can be used with different parametric multivariate probability density functions replacing the multivariate Gaussian in

Artificial neural networks, first introduced by the neurophysiologist Waren McCulloch and the mathematician Walter Pitts in 1943 [^{T} ^{T} _{0}). Using a single LTU amounts to training a binary linear classifier by learning the weights

For illustrative purposes we choose for our decoding function _{ω}(

For training, we used He-initialization, which initializes the weights with a random number from a normal distribution with zero mean and standard deviation _{in} is the number of inputs to units in a particular layer [

We start by considering three simple chemical reaction networks for which we can obtain exact information values using the model-based approach outlined in Section

The three examples are all instances of a simple molecular birth-death process, where molecules of ^{(1)} and ^{(2)}, with equal probability, _{U}(^{(1)}) = _{U}(^{(2)}) = 0.5.

^{(1)}) = 0.1, ^{(2)}) = 0.07. Here, the steady state is given by Poisson distribution with mean number of molecules 〈^{−1}. These dynamics stylize a class of frequently observed biochemical responses where the steady-state mean expression level encodes the relevant input value. Even if the stochastic trajectories for the two possible inputs are noisy as shown in

^{(1)}, ^{(2)}, ^{−4}. In the early period, this network approaches input-dependent steady state with means whose differences are larger than in Example 1, but the difference decays away for

^{(1)}) = 0.1, ^{(2)}) = 0.05, ^{(1)}) = 0.01, ^{(2)}) = 0.005, and are chosen so that the mean 〈^{1} than ^{2}. While this case is not frequently observed in biological systems, it represents a scenario where, by construction, no information about the input is present at the level of single concentration values and having access to the trajectories is essential. Because there is no difference in the mean response, we expect linear decoding methods to provide zero bits of information about the input. This case is also interesting because of the recent focus on pulsatile stationary-state dynamics in biochemical networks [

Three example birth-death processes, specified by the reactions in the top row for each of the two possible inputs (^{(1)} in blue, ^{(2)} in red), stylize simple behaviors of biochemical signaling networks.

Before proceeding, we note that our examples are not intended to be realistic models of intracellular biochemical networks, but are chosen here for their simplicity and analytical tractability, in order to benchmark model-free estimators against known “gold truth” standard. In particular, while our examples include intrinsic noise due to the stochasticity of biochemical reactions at low concentration, they do not include extrinsic noise or cell-to-cell variability which, in some systems, is known to importantly or even dominantly contribute to the total variability in the response [

Armed with the full stochastic model for the three example reaction networks, we can compute the mutual information,

Exact Monte Carlo approximation for the information,

One can similarly compute the Bayes-optimal or MAP decoding bound using _{UB}, is not tight in this case, it nevertheless provides a control of how far optimal decoding could be from the true information estimate, a question that has repeatedly worried the neuroscience community facing similar problems [

Examining the information increase specific to each example, we see 1 bit is reached more quickly in Example 2 than in Example 1 because of a larger difference in reaction rate parameters for both inputs in Example 2 relative to Example 1 in the period

While it is interesting to contemplate whether biological systems themselves could compute with or act on singular, precisely-timed reaction events and thus make optimal use of the resulting channel capacity (mimicking the debate between spike timing code and spike rate code in neuroscience), our primary focus here is to estimate information flows from experimental data. Typically, experiments record the state of the system—e.g., concentration of signaling molecules—in discretely sampled time. To explore the effects of time discretization, we first fix the observation length for our trajectories to

_{exact}(_{UB} (light solid gray) are plotted as a function of

_{exact}(_{UB}, to the theoretical limits from

After establishing our model-based “gold standard” for decoding-based estimators acting on trajectories represented in discretized time,

Performance of various model-free decoding estimators (colored lines) for Examples 1 _{MAP} (black line), as a function of input trajectory dimension, _{SVM(lin)} (orange); radial basis functions SVM, _{SVM(rbf)} (blue); the Gaussian decoder with diagonal regularization (see _{GD} (yellow); multi-layer perceptron neural network, _{NN} (green). Dashed vertical orange line marks the

_{MAP}, especially for the relevant regime

^{4}. We hypothesized that the failure of the Gaussian decoder on Example 2 is due to the difficulty of the Gaussian approximation to capture the period

_{NN} ≈ 0.65 bits at ^{5}. Given their expressive power, neural network decoders should be viewed as the opposite benchmark to the linear decoders: they have the ability to pick up complex statistical structures but only with a sufficient number of samples. Indeed, as we will see subsequently for applications to real data, neural networks can match and exceed the performance of SVMs. We emphasize that we used a neural network with a fixed architecture for all three examples on purpose, to make results comparable across examples; the performance can likely be improved by optimizing the architecture separately for each estimation problem. We did examine the issue of network architecture in greater detail in

We next asked whether our conclusions hold also when the space of possible inputs is expanded beyond binary, assuming that _{U}(

^{3} sample trajectories per condition, solid lines using ^{4} samples per condition; in both cases, we show an average over 20 independent replicates, error bars are suppressed for readability.

Our expectation is that with increasing

There exist many algorithms for estimating information directly, without making use of the decoding lower bound. The best known estimator for continuous signals is perhaps the k-nearest-neighbor (knn) estimator [

We therefore decided to focus on the comparison of decoding estimators with knn, which has been used previously on data from biochemical signaling networks [

Information estimates for decoding-based (color bars) and knn (gray bar) algorithms (here we set _{exact}(^{4} trajectory samples per input. The performance of knn can be substantially improved by adding a small amount of gaussian noise to the trajectory samples; its resulting performance as a function of

To illustrate the use of our estimators in a realistic context, we analyzed data from two previously published papers. The first paper focused on the representation of environmental stress in the nuclear localization dynamics of several transcription factors (here we focus on data for Msn2, Dot6, and Sfp1) in budding yeast [^{2+}) [

Data replotted from Ref [^{2+} (bottom row). ^{2+}, respectively) at ^{2+} (right half). Data for ERK: ^{2+}:

Consistent with the published report [

It is interesting to look at the stationary responses in yeast which have not previously been analyzed in detail. First, low estimates provided by linear SVM for Msn2 and Dot6 imply that information in the stationary regime, if present, cannot be extracted by the linear classifier. Second, the Gaussian decoder also performs poorly in the stationary regime, potentially indicating that the relevant features are encoded in higher-than-pairwise order statistics of the response (e.g., pulses could be “sparse” features as in sparse coding [

Random pulses that encode stationary environmental signals have been observed for at least 10 transcription factors in yeast [

A different picture emerges from the mammalian signaling network data shown in ^{2+} data (perhaps due to low signal-to-noise ratio). It also shows counter-intuitive non-monotonic behavior with trajectory duration

Increasing availability of single-cell time-resolved data should allow us to address open questions regarding the amount of information about the external world that is available in the time-varying concentrations, activation or localization patterns, and modification state of various biochemical molecules. Do full response trajectories provide more information than single temporal snapshots, as early studies suggest? Is this information gain purely due to noise averaging enabled by observing multiple snapshots, or—more interestingly—due to the ability of these intrinsically high-dimensional signals to provide a richer representation of the cellular environment? Can we isolate biologically relevant features of the response trajectories, e.g., amplitude, frequency, pulse shape, relative phase or timing, without

Here, we made methodological steps towards answering these questions by focusing on two related problems: first, if we are given a full stochastic description of a biochemical reaction network, under what conditions can we theoretically compute information transmission through this network and various related bounds; second, if we are given real data with no description of the network, what are tractable schemes to estimate the information transmission. We show that when the complete state of the reaction network is observed and the inputs are discrete sets of reaction rates, there exist tractable Monte Carlo approximation schemes for the information transmission. These exact results that we compute for three simple biological network examples then serve to benchmark a family of decoding-based model-free estimators and compare their performance to the commonly-used knn estimator. We show that decoding-based estimators can closely approach the optimal decoder performance and in many cases perform better than knn, especially with typical problem dimensions (^{2} − 10^{3}). This is especially true when we ask about the combinatorial representation of the environmental state in the time trajectories of several jointly observed chemical species, as in our previous work [

It is necessary to emphasize the flexibility of the decoding approach: decoding-based information estimation is based directly on the statistical problems of classification (for discrete input variable,

By construction, decoding-based estimators only provide a lower bound to the true information. This, however, could turn out to be a smaller problem in practice than it appears in theory, especially for biochemical reaction networks. First, our extension to the Feder-Merhav bound provides us with an estimate of how large the gap between the true information and the decoded estimation can be. The bound is not tight on our examples, and can only be applied when the optimal MAP decoder can be constructed [

We also mention a caveat when using decoding-based estimators that rely on classification or regression methods with large expressive power, such as neural networks. While it is possible to successfully guard against overfitting within the same dataset using cross-validation, scientific insights into biological function often require generalization beyond one particular dataset. Typically, we ask for generalization at least over independent experimental replicates, but sometimes even over similar (but not same) external conditions, strains, or experimental setups. This can present a serious issue if e.g., neural networks overfit to such systematic variations between replicates or conditions even when such variations are not biologically relevant. Regularization alone will not necessarily guard against this, unless the networks are actually trained over a subset of all data on which they will be tested. A pertinent recommendation here is to evaluate the difference in performance of expressive decoding-based estimators when trained over a subset or over all replicates, and to compare that to the generalization of less-expressive methods for which the sufficient statistics are known (e.g., linear or Gaussian decoders).

We conclude by emphasizing a simple yet important point. The decoding-based approach that we introduced here should also motivate us to look beyond methodological problems of significance and estimation, to truly biological problems of cellular decision making. Currently, data on biological regulatory processes is often analyzed by looking for “statistically significant differences” in the network response for, say, two possible network inputs. For example, one may report that the steady-state mean expression level of a certain gene is significantly larger in the stimulated vs unstimulated condition, with the statistical significance of the mean difference established through an appropriate statistical test that takes into account the number of collected population samples. While statistical significance is a necessary condition to validly report

(TIF)

_{MAP} decoding bound (black), Gaussian decoder estimate, _{GD(reg)}, with optimal diagonal regularization for each _{GD(noreg)} (brown). Without regularization, the estimate suffers an abrupt drop as ^{3}, here

(TIF)

Shown are _{NN} estimates on Example 3, analogous to

(TIF)

Gaussian approximation is evaluated for Example 3 in _{exact}(_{exact}.

(TIF)

Compared to knn results in ^{3}, used in ^{2}, used in ^{3} and ^{2}, without the addition of noise (C) or with the addition of noise (D).

(TIF)

When the samples are limited, here to

(TIF)

By randomly shuffling the binary labels assigned to different response trajectories, we break all response-input correlations leading to zero information. Here we test whether our estimators correctly report zero information within error bars given a finite number of samples, or are subject to positive information estimation bias. Decoding-based estimates (linear SVM, red; kernelized SVM, blue; Gaussian decoder, yellow) and knn (gray). First three sets of bars correspond to synthetic examples of

(TIF)

Shown are information estimates as a function of the total trajectory duration, ^{2+} (B). Plotting conventions, procedures, and data set sizes same as in

(TIF)

We thank Alejandro Granados, Mihal Hledik, Julian Pietsch, and Christoph Zechner for stimulating discussions.