Optimal prediction with resource constraints using the information bottleneck

doi:10.1371/journal.pcbi.1008743

Fig 1.

A schematic representation our predictive information bottleneck.

On the left hand side, we have coordinates X_t evolving in time, subject to noise to give X_t+Δt. We construct a representation, , that compresses the X_t (minimizes ) while retaining as much information about X_t+Δt (maximizes ) up to the weighting of the prediction compared to the compression set by β.

More »

Expand

Fig 2.

Schematic of the stochastically driven damped harmonic oscillator (SDDHO).

(a) The SDDHO consists of a mass attached to a spring undergoing viscous damping and experiencing Gaussian thermal noise of magnitude. There are two parameters to be explored in this model: and . (b) , Δt = 1. Here, we show an example distribution of the history (yellow, left) and show its time evolution (purple, right). We take 5000 samples from the distribution, at random, and let these points evolve in time according to the SDDHO equation of motion. We visualize the evolution of the distribution of points in time via an ellipse representing the 1 − Σ confidence region of the rescaled position and velocity. (c) We illustrate the limiting case of the information bottleneck method when β → ∞. Representations of the past and how that constrains an estimate of the future position and velocity of the object can be compared to the prior be examining the relative size and shape of their respective ellipses. The blue circle represents the prior and its 1 − Σ confidence region. In yellow, we plot the inferred 1 − Σ confidence interval associated with the estimate of past, X_t, given by the encoding distribution when β → ∞. In this limit, the distribution is reduced to a single point. In purple, we plot the 1 − Σ confidence region of X_t+Δt given our knowledge of X_t. Precise knowledge of the past coordinates reduces the our uncertainty about the future position and velocity (as compared to the prior), as depicted by the smaller area of the purple ellipse.

More »

Expand

Fig 3.

We consider the task of predicting the path of an SDDHO with and Δt = 1.

(a) (left) We encode the history of the stimulus, X_t, with a representation generated by the information bottleneck, , that can store 1 bit of information. Knowledge of the coordinates in the compressed representation space enables us reduce our uncertainty about the bar’s position and velocity, with a confidence interval given by ellipse in yellow. This particular choice of encoding scheme enables us to predict the future, X_t+Δt with a confidence interval given by the purple ellipse. The information bottleneck guarantees this uncertainty in future prediction is minimal for a given level of encoding. (right) The uncertainty in the prediction of the future can be reduced by reducing the overall level of uncertainty in the encoding of the history, as demonstrated by increasing the amount of information can store about X_t. However, the uncertainty in the future prediction cannot be reduced below the variance of the propagator function. (b) We show how the information with X_t+Δtscales with the information about X_t, highlighting the points represented in panel A.

More »

Expand

Fig 4.

Possible behaviors associated for the SDDHO for a variety of timescales with a fixed of 5 bits.

For an overdamped SDDHO, panel a-c, the optimal representation continues to encode mostly position information, as velocity is hard to predict. For the underdamped case, panels g-i, as the timescale of prediction increases, the optimal representation changes from being mostly position information to being a mix of position and velocity information. Optimal representations for critically damped input motion are shown in panels d-f. Comparatively, overdamped stimuli do not require precise velocity measurements, even at long timescales. Optimal predictive representations of overdamped input dynamics have higher amounts of predictive information for longer timescales, when compared to underdamped and critically damped cases.

More »

Expand

Fig 5.

Example of a sub-optimal compression.

An optimally predictive, compressed representation, in panel (a) compared to a suboptimal representation, in panel (b) for a prediction at Δt = 1 in the future, within the underdamped regime (ζ = 1/2). We fix the mutual information between the representations and X_t ( bits), but find that, as expected, the suboptimal representation contains significantly less information about the future.

More »

Expand

Fig 6.

Representations learned on underdamped systems can be transferred to other types of motion, while representations learned on overdamped systems cannot be easily transferred.

(a) Here, we consider the information bottleneck bound curve (black) for a stimulus with underlying parameters, (ζ, Δt). For some particular level of , we obtain a mapping, that extracts some predictive information, denoted , about a stimulus with parameters (ζ, Δt). Keeping that mapping fixed, we determine the amount of predictive information for dynamics with new parameters (ζ′, Δt′), denoted by . (b) One-dimensional slices of in the (ζ′, Δt′) plane: versus ζ′ for Δt′ = 1. (top), and versus Δt′ for ζ′ = 1. Parameters are set to (ζ = 1, Δt = 1), . (c) Two-dimensional map of versus (ζ′, Δt′) (same parameters as b). (d) Overall transferability of the mapping. The heatmap of (c) is integrated over ζ′ and Δt′ and normalized by the integral of . We see that mappings learned from underdamped systems at late times yield high levels of predictive information for a wide range of parameters, while mappings learned from overdamped systems are not generally useful.

More »

Expand

Fig 7.

The ability of the information bottleneck Method to predict history-dependent stimuli.

(a) The prediction problem, using an extended history and a future. This problem is largely similar to the one set up for the SDDHO but the past and the future are larger composites of observations within a window of time t−t₀: t, expressed as X_past for the past and t + Δt: t + Δt + t₀, expressed as X_future for the future. (b) Predictive information with lag Δt. (c) The maximum available predictive information saturates as a function of the historical information used t₀.

More »

Expand

Fig 8.

The information bottleneck solution for a Wright Fisher process.

(a) The Wright-Fisher model of evolution can be visualized as a population of N parents giving rise to a population of N offspring. Genotypes of the offspring are selected as a function of the parents’ generation genotypes subject to mutation rates, μ, and selective pressures s. (b) Information bottleneck schematic with a discrete (rather than continuous) representation variable, . (c) Predictive information as a function of compression level. Predictive information increases with the cardinality, m, of the representation variable. The amount of predictive information is limited by log(m) (vertical dashed lines) for small m, and the mutual information between allele frequencies at time t + Δt and time t, I(X_t+Δt;X_t) (horizontal dashed line), for large m. Bifurcations occur in the amount of predictive information. For small , the encoding strategies for different m are degenerate and the degeneracy is lifted as ) increases, with large m schemes accessing higher ranges. Parameters: N = 100, Nμ = 0.2, Nμ = 0.2, Ns = 0.001, Δt = 1. (d-i) We explore information bottleneck solutions to Wright-Fisher dynamics under the condition that the cardinality of , m, is 2 and take β to be large enough that , β ≈ 4. Parameters: N = 100, Ns = 0.001, Δt = 1, and Nμ = 0.2, Nμ = 2, and Nμ = 40 (from left to right). (d-f) In blue, we plot the steady state distribution. In yellow and red, we show the inferred historical distribution of alleles based on the observed value of . Note that each distribution is corresponds to roughly non-overlapping portions of allele frequency space. (g-i) Predicted distribution of alleles based on the value of . We observe that as mutation rate increases, the timescale of relaxation to steady state decreases, so historical information is less useful and the predictions becomes more degenerate with the steady state distribution.

More »

Expand

Fig 9.

Transferability of prediction schemes in Wright-Fisher dynamics.

We transfer a mapping, , trained on one set of parameters and apply it to another. We consider transfers between two choices of mutability, Nμ₁ = 0.2 (low) and Nμ₂ = 20 (high), with N = 100, Ns = 0.001, Δt = 1. The dotted line is the steady state allele frequency distribution, the solid lines are the transferred representations, and the dashed lines are the optimal solutions. The top panels correspond to the distributions of X_t and the bottom panels correspond to distributions of X_t+Δt. (a) Transfer from high to low mutability. Optimal information values: and ; transferred information values: and . Representations learned on high mutation rates are not predictive in the low mutation regime. (b) Transfer from low to high mutability. Optimal information values: and and . Transferred information values: and . Transfer in this direction yields good predictive informations.

More »

Expand

Fig 10.

Amount of predictive information in the Wright Fisher dynamics as a function of model parameters.

(a-c), Value of the asymptote of the information bottleneck curve, I(X_t;X_t+Δt) with: (a) N = 100, Ns = 0.001, Δt = 1 as a function of μ; (b) N = 100, Nμ = 0.2, Ns = 0.001 as a function of Δt; and (c) N = 100, Nμ = 0.2, and Δt = 1 as a function of s.

More »

Expand

Fig 11.

Encoding schemes with m > 2 representation variables.

The steady state is plotted as a dotted line and the representation for each realization of the value of are plotted as solid lines. The representations which carry maximum predictive information for (a) m = 2 at bit, and (b) m = 3 at bits. The optimal representations at large m tile space more finely and have higher predictive information. The optimal representations for m = 200 at fixed β = 1.01 (, ) (c) and β = 20 (, ). (d) At low , many of the representations are redundant and do not confer more predictive information than the m = 2 scheme. A more explicit comparison is given in S3 Fig. At high , the degeneracy is lifted. All computations done at N = 100, Nμ = 0.2, Ns = 0.001, Δt = 1.

More »

Expand